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VLSI  BASED  MULTIPROCESSOR  COMMUNICATIONS  NETWORKS 
Progress  Report:  Year  1 
Mark  A.  Franklin  and  Donald  F.  Wann 

1.  Introduction 

This  document  is  the  Annual  Progress  Report  for  the  Office  of  Naval 
Research  contract  number  N00014-80-C-0761 , (NR#:  375-033)  entitled  "VLSI  Based 
Multiprocessor  Communications  Networks".  The  contract  began  on  September  1, 

1980  and  was  approved  on  scientific/technical  grounds  for  a  duration  of  three 
years.  Incremental  funding  was  approved  for  year  two  of  the  research  and 
this  work  has  just  begun.  The  bulk  of  this  report  documents  research  progress 
and  major  achievements  during  the  first  funding  year  of  the  contract.  In  addi¬ 
tion,  research  plans  for  the  coming  year  are  reviewed. 

The  research  undertaken  has  been  principally  concerned  with  high  band¬ 
width  communications  networks  suitable  for  use  in  multiprocessor  computer 
systems.  The  goals  have  been  to  study  the  impact  of  VLSI  technology  on  the 
design  of  such  networks,  and  to  advance  the  development  of  design  methodologies 
oriented  to  this  technology  and  application  domain. 

The  need  for  such  an  effort  and  the  rationale  for  the  approach  taken 
was  discussed  in  the  original  proposal.  Briefly,  the  research  is  motivated 
by  three  factors.  The  first  relates  to  current  computer  architecture  efforts 
at  achieving  very  high  computational  power  and  reliability.  As  technology 
presses  the  physical  limits  of  component  performance  (1,2)  large  increases  in 
computational  power  will  become  achievable  principally  through  the  exploita¬ 
tion  of  parallel  numerical  methods  (3-8)  implemented  on  appropriate  parallel 
(multiple  processor)  computer  systems  (9-15).  Such  tightly  coupled  computer 
systems  (perhaps  made  up  of  large  numbers  of  low  cost  microprocessors)  inevi¬ 
tably  require  high  bandwidth  communications  networks  for  exchange  of  information 
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and  sharing  of  memory.  Thus  the  need  for  such  networks  is  present  and 
growing  (16-20).  This  is  explored  in  a  somewhat  wider  context  in  Appendix  V. 
Note  that  the  sort  of  high  computational  power  referred  to  occurs  in  numerous 
advanced  Navy  applications  ranging  from  complex  pattern  recognition  and  sig¬ 
nal  processing  problems  to  on-line  missile  defense  and  tracking  problems. 

The  second  factor  relates  to  the  costs  and  performance  associated  with 
such  networks  when  embedded  in  large  (hundreds  to  thousands  of  microprocessors) 
computer  systems.  A  poorly  designed  network  can  rapidly  become  a  performance 
bottleneck  as  the  number  of  processors  and  traffic  in  the  system  increases. 

Very  high  bandwidth  networks,  on  the  other  hand,  can  be  extremely  costly,  and 
may  indeed  have  a  cost  which  grows  faster  than  the  growth  of  processors  in 
the  system  (e.g.  the  number  of  processors  may  grow  as  0(N)  while  the  network 
may  grow  as  N  log  N).  Thus  for  large  systems,  the  cost  of  the  network  may 
dominate  the  cost  of  the  overall  multiprocessor  system,  and  the  performance 
of  the  network  may  be  the  critical  factor  in  overall  systems  performance. 

This  leads  to  the  third  and  final  point  to  be  considered,  the  role  of 
VLSI.  VLSI  technology  opens  up  a  new,  and  largely  unexplored,  design  domain 
which  possesses  certain  promising  properties  and  interesting  problems  when 
applied  to  network  design.  There  is  great  potential  here  for  reducing  the 
costs  while  maintaining  or  enhancing  the  performance  of  interconnection  net¬ 
works.  The  complexity  and  intelligence  of  the  switching  nodes,  combined  with 
the  regular  topologies  associated  with  such  networks  seem  to  make  them  well- 
suited  to  requirements  for  successful  VLSI  design  (i.e.  large  amounts  of  logic 
needed  in  a  step  and  repeat  manner) .  On  the  other  hand  the  problems  of  network 
nonplanarity  pin  limitations,  network  partitioning,  and  synchronization 
represent  problems  which  must  be  overcome  before  well-designed  VLSI  networks 
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can  be  achieved.  This  contract  is  concerned  with  the  exploration  of  these 
questions. 

The  major  accomplishments  of  the  research  thus  far  are  summarized  in 
Section  2  to  follow.  These  accomplishments  can  be  divided  into  four  parts. 

The  first  relates  to  developing  formal  graph  models  which  permit  characteri¬ 
zation  of  general  interconnection  networks.  The  second  concerns  application 
of  these  models,  in  conjunction  with  a  general  descriptive  model  of  VLSI  com¬ 
ponents,  to  determine  space  (area  of  the  fabricated  chip)  and  time  (delay  in 
communication  through  the  chip)  bounds  of  various  networks  when  implemented 
in  the  VLSI  technology  so  that  pin  constraints  of  the  packaged  chip  can  be 
satisfied.  The  third  deals  with  how  such  VLSI  based  interconnection  networks 
can  be  partitioned  in  an  optimum  manner.  The  fourth  reports  on  an  important 
synchronization  problem  which  arises  in  partitioned  networks  and  considers 
certain  design  techniques  for  overcoming  this  problem. 

Section  3  summarizes  our  plans  for  year  two  of  the  contract.  Basically 
the  outline  discussed  in  the  original  proposal  will  be  maintained.  This  in¬ 
cludes  further  work  in  the  research  areas  discussed  above.  Section  4  concludes 
with  a  summary  discussion  of  the  research  thus  far. 

A  number  of  appendices  follow  the  main  body  of  the  report.  These  include 
research  papers  which  have  been  published  or  submitted  for  publication  during 
the  first  contract  year,  and  several  working  papers  which  discuss  research  in 


progress. 
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2.  Research  Summary  and  Major  Accomplishments 
2.1  Formal  Models 

One  of  the  first  research  tasks  undertaken  concerned  determining  what 
effect  differing  interconnection  topologies  have  on  such  VLSI  oriented 
measures  as  chip  area  and  delay.  Note  that  traditional  network  measures 
of  complexity  have  almost  always  centered  on  costs  determined  by  switch 
counts,  and  delays  based  on  aggregate  mean  path  switch  delays.  Such  measures 
indeed  make  sense  in  an  environment  where  either  discrete  electrical  or 
electromechanical  devices  are  utilized.  When  considering  placing  networks 
on  VLSI  chips,  however,  the  situation  changes  substantially.  Within  a  chip 
networks  often  tend  to  be  connection  intensive  rather  than  component  inten¬ 
sive.  Thar,  is,  an  extensive  area  on  the  chip  is  occupied  with  connections 
between  components  rather  than  with  the  components  themselves.  Furthermore, 
the  delays  associated  with  signal  propagation  along  these  connections  can  be 
an  important  component  in  the  overall  delay. 

These  lines  of  inquiry  were  initially  pursued  in  a  comparison  study  of 
Banyan  and  crossbar  networks  (21).  In  that  work,  a  constructive  approach 
was  used  to  determine  the  area  and  delay  requirements  of  these  two  networks 
when  implemented  on  a  single  chip.  The  technical  approach  used  associated 
a  particular  chip  layout  with  each  network.  This  layout  was  posed  as  being 
near  optimal .  The  layouts  reflected  the  topology  of  the  networks,  and 
various  geometric  and  physical  electronics  arguments  were  presented  in  de¬ 
veloping  overall  area  and  delay  expressions.  While  this  work  was  successful, 
it  clearly  demonstrated  the  need  for  a  more  comprehensive  study  of  the  prob¬ 
lem.  That  is,  a  more  general  approach  was  needed  which  was  not  so  tied  to 
particular  networks  or  layout  assumptions. 


Research  was  pursued  in  this  area  over  the  past  year.  First  a 
general  graphical  approach  to  specifying  the  topological  properties  of 
interconnection  schemes  was  developed  and  tested  on  a  variety  of  networks. 

While  reviewing  the  principal  interconnection  networks  in  terms  of  their 
graph  specifications,  a  characterization  of  these  networks  from  both  blocking 
and  traditional  complexity  viewpoints  was  undertaken.  The  blocking  proper¬ 
ties  of  a  network  relate  to  how  connections  and  components  in  the  network  are 
shared  by  different  input/output  paths,  and  are  important  factors  in  deter¬ 
mining  the  bandwidth  of  the  network.  This  work  is  reported  in  Appendix  I. 

While  a  graph  model  may  be  used  to  specify  network  topology,  there 

remain  the  problems  of  specifying  the  VLSI  components  onto  which  the  network 
is  mapped,  and  the  procedure  to  be  followed  in  the  mapping  process.  With 
regard  to  the  first  problem,  one  would  like  a  reasonably  high  level  approach 
to  specifying  components  which  avoids  the  issues  of  detailed  electronic  design 
while  retaining  realistic  performance  properties.  In  the  method  adopted,  com¬ 
ponents  (i.e  gates)  are  modelled  in  terms  of  parameters  describing  their 
active  channel  areas,  fixed  pullup  to  pulldown  ratios,  device  sizes  related 
to  minimum  feature  sizes,  and  bounding  boxes  which  can  be  readily  mapped  onto  a 
chip  area  grid.  Time  in  this  model  is  associated  with  the  time  required 
for  a  minimum  size  tranristor  to  drive  the  capacitance  of  a  unit  length  line. 

A  full  description  of  the  model  is  not  pursued  here;  however,  work  is  pro¬ 
ceeding  (22)  and  will  be  reported  in  a  subsequent  paper.  The  second  problem 
of  describing  the  VLSI  mapping  procedure  will  be  considered  in  section  2.2. 
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2.2  Space-Time  Performance  Bounds 

There  are  three  principal  approaches  to  the  mapping  problem  discussed 
in  the  prior  section.  The  first  is  to  perform  detailed  chip  designs  and 
layouts  for  the  networks  of  interest  and  from  these  designs  determine  the 
space  and  time  performance  measures  in  question.  While  this  is  clearly 
very  time  consuming  and  perhaps  impossible  when  more  than  a  few  networks 
are  involved,  of  more  importance  is  that  a  clear  comparison  between  networks 
is  very  difficult  since  uniform  design  standards  are  essentially  nonexistent. 

In  this  situation  differences  in  design  details  may  have  an  important  effect 
on  results.  The  second  approach  considers  plausible  layouts  of  networks  where 
individual  network  nodes  are  assumed  to  be  rectangular  regions  of  known  area. 
Detailed  logic  design  is  avoided  with  this  approach  and  order  expressions 
(e.g.  0(N  logN) )  for  the  performance  measures  of  interest  can  be  derived  (21). 
While  this  intermediate  level  of  analysis  can  yield  fairly  realistic  expressions 
for  area  and  delay  it  still  relies  on  individual  design  judgments  to  determine 
reasonable  node  and  connection  layouts. 

The  final  approach  considers  networks  as  abstract  computation  graphs  and 

attempts  to  develop  lower  bound  expressions  for  arbitrary  networks  based  on 

topological  properties  of  their  graphs  and  some  knowledge  of  the  information 

flow  through  the  graphs.  Thompson  (23)  pioneered  in  this  area  (now  referred 

to  as  VLSI  complexity  theory)  by  combining  certain  graph  theoretical  results 

with  a  general  VLSI  component  and  time  model.  His  main  result  is  that  the 

area  occupied  by  the  wires  and  nodes  of  a  VLSI  design  that  corresponds  to  a 

2 

graph  with  minimum  bisection  width  w  is  greater  than  w  /4  .  Informally, 
the  minimum  bisection  width  of  a  graph  is  the  smallest  number  of  edges  that 


must  be  removed  to  disconnect  one  halt  of  the  vertices  of  a  graph  from  the 


ocher  half.  While  being  asymptotically  tight,  this  bound  is  in  many  cases 
weak.  The  second  half  of  Thompson's  model  has  to  do  with  time,  and  though 
important,  will  not  be  pursued  here. 

There  are  two  important  drawbacks  to  Thompson's  development.  The  first 
relates  to  the  VLSI  area  model  he  used  in  formulating  the  area  bounds.  A 
key  assumption  made  is  that  the  area  assigned  to  nodes  is  independent  of  the 
logic  inside  them  and  varies  as  the  square  of  the  input  and  output  lines 
associated  with  the  node.  Thus  a  node  with  4  input  and  output  lines  is 
assigned  an  area  of  16  units  irrespective  of  the  internal  logic  of  the  node. 

The  second  drawback  concerns  the  calculation  of  time  delays.  The  model  takes 
no  account  of  delay  within  nodes  and  assumes  unit  delay  across  wires  (by  assum¬ 
ing  logorithmically  staged  line  drivers  matched  to  each  line).  The  result  is 
that  the  time  bounds  established  are  very  weak. 

The  goal  of  our  research  in  this  area  has  been  to  preserve  the  spirit 
of  Thompson's  approach  while  making  the  model  more  realistic.  That  is,  to 
preserve  the  graph  theoretic  orientation  which  permits  uniform  application 
of  the  model  over  all  graphically  defined  networks,  but  to  provide  for  more 
appropriate  area  and  time  expressions  and  thus  obtain  tighter  lower  bounds  on 
network  area.  To  do  this  a  number  of  techniques  have  been  used  to  determine 
node  area. 

The  first  technique  considers  establishing  node  areas  based  on  fan-in/ 
fan-out  considerations.  These  results  are  not  significantly  better  than 
Thompson's  bound  but  provide  some  insight  into  various  node  area  arguments. 

In  particular,  the  bounds  can  be  applied  where  logic  within  a  node  is  simple, 
but  the  node  has  a  large  degree  (e.g.  a  parallel  in/parallel  out  shift 
register) .  The  second  tehnique  demonstrates  how  node  areas 
can  be  defined  in  a  recursive  manner  based  on  recursive 


definitions  of  network  computation  graphs.  The  third  technique  applies  in  situ¬ 


ations  where  the  nodes  are  defined  only  in  terms  of  their  functional  capabilities, 
not  in  terms  of  given  logic  implementations.  In  these  cases  a  finite  state 
machine  model  of  a  node  has  been  proposed,  and  a  clocked  PLA  (Programmable 
Logic  Array)  implementation  developed.  Lower  bound  expressions  for  the  area 
required  by  the  machine  can  be  obtained  and  used  as  a  lower  bound  on  node  area. 
Details  of  this  work  are  now  being  prepared.  The  major  results  of  our  area 
studies  applied  to  interconnection  networks  are  summarized  in  the  table  below. 

In  almost  all  cases  the  lower  bounds  obtained  are  significantly  higher  than  those 
obtained  using  traditional  component  count  measures. 


Network  Blocking  Area  Control 

Category  (lower  bound)  Included? 


Crosspoint 

Nonblocking 

N**4 

Yes 

Switch 

Mesh  Connected 

Nonblocking 

N**2 

Yes 

Crossbar 

Clos  Network 

Nonblocking 

N** (5/2) 

No 

Delta  Network 

Blocking 

N**2 

Yes 

ORAN 

Rearrangeable 

N**2 

No 

Batcher 

Rearrangeable 

N**2 

Yes 

Interconnection 

(N=Number 

Networks  in 
of  Ports) 

VLSI 

Research  on  establishing  tighter  time  bounds  has  been  pursued  for 
those  computation  graphs  with  loop-free  input-output  paths.  Most  inter¬ 
connection  networks  of  interest  fall  in  this  category.  The  initial 
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assumption  has  been  made  that  the  system  operates  in  a  synchronous  manner 
with  a  two  phase  clock  where  the  active  phases  of  the  clock  are  used  either 
for  capturing  the  input  data  at  nodes  or  for  validating  output  data  at 
nodes.  Inter-clock  periods  are  used  either  for  computation  within  nodes 
or  transfer  of  information  between  nodes.  Based  on  the  PLA  structure  of 
the  finite  state  machine  and  the  topology  of  the  network,  bounds  on  the 
node  time  and  the  data  transfer  time  can  be  obtained.  In  conjunction  with 
the  area  bounds  discussed  above,  space-time  bounds  can  be  derived.  The 
time  bound  derivations  here  are  complex  and  are  still  being  developed.  A 
written  report  on  this  material  will  be  forthcoming.  In  the  sections  to 
follow  we  move  from  the  individual  chip  domain  to  the  systems  domain  where 
problems  which  arise  when  large  networks  requiring  collections  of  chips  are 


investigated. 
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2.3  Partitioning  and  Pin  Limitations 

Once  a  fuller  understanding  of  the  impact  of  network  topology,  layout  and 
general  VLSI  design  constraints  has  been  achieved  at  the  single  chip  level.  It 
is  important  to  examine  the  problems  associated  with  large  networks  requiring 
many  chips.  The  problems  of  chip  interconnections,  pin  limitations  and  network 
partitioning  have  often  been  omitted  from  the  formal  modelling  process.  These 
questions  seem  to  fall  in  the  domain  of  "packaging"  problems  and  related  esoterica, 
and  as  such  have  been  traditionally  neglected  by  the  research  communities  in¬ 
terested  in  characterization  of  formal  design  processes  and  methodologies. 

Indeed  the  research  discussed  here  is  unique  and  the  publication  found  in 
Appendix  III  is  one  of  the  only  reported  efforts  in  the  area. 

The  principal  problem  can  be  seen  from  a  simple  example.  Consider  a 
network  with  N'  input  ports  and  M'  output  ports,  with  each  port  being  B'  bits 
in  width.  Pick  N' ,  M'  and  Bf  to  be  12,  12  and  16  so  that  the  logic  required 
for  implementation  will  have  little  difficulty  fitting  on  a  single  VLSI  chip. 

To  support  this  chip  at  least  B'(N'+M')  or  384  pins  would  be  required.  This  is 
much  larger  than  common  commercially  available  integrated  circuit  carriers. 

Given  that  pins  are  typically  placed  on  100  mil  centers  along  the  periphery 
of  the  package,  the  total  number  of  pins  is  limited  mainly  by  the  increase  in 
the  physical  length  of  the  package,  which  in  this  example  would  require  a 
19.2  inch  dual-in-line  package. 

Another  way  of  looking  at  this  is  in  terms  of  the  number  of  pins,  P. 

in, 

typically  required  whe.,  implementing  a  logic  function  requiring  C  circuits. 

This  has  been  found  l  be  reasonably  approximated  by  (2) : 

* 

P.  =  KCb  b  =  .5 

in 
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One  would  expect  a  function  of  this  form  since  the  number  of  circuits  on  a  chip 
varies  directly  with  the  chip  area,  while  the  number  of  pins  varies  directly 
with  the  chip  perimeter.  A  straightforward  set  of  calculations  indicates  that 
for  interconnection  networks  of  the  sort  considered  the  value  of  K  is  more 
than  an  order  of  magnitude  greater  than  would  be  expected  from  circuit  number 
considerations  alone. 

Given  the  pin  constraints  imposed  by  standard  packaging  technologies, 
the  problem  is  how  does  one  optimally  partition  a  large  network 
so  that  pin  constraints  are  satisfied,  and  a  given  performance  measure  (e.g. 
chip  count,  network  delay,  count-delay  product)  is  minimized.  This  is  the 
problem  attacked  in  the  paper  given  in  Appendix  III. 

Two  partitioning  strategies  are  introduced,  and  expressions  for  obtaining 
the  optimum  chip  size  and  chip  data  path  width  are  developed  for  two  different 
network  types,  and  two  standard  network  control  schemes  (e.g.  synchronous  and 
asynchronous).  These  expressions  are  parameterized  on  the  basis  of  chip  pin 
constraints  and  a  variety  of  VLSI  component  parameters.  They  can  be  readily 
used  as  part  of  a  larger  design  study. 

One  key  conclusion  is  that  while  single  bit  per  slice  partitioning  is 
often  optimum  when  dealing  with  incremental  or  mesh  connected  cross-bar  networks, 
it  is  generally  not  optimum  when  N  log  N  networks  (e.g.  Banyan,  Omega,  Inverse 
Binary  N-Cube)  are  used.  For  example  when  implementing  a  Banyan  network  of  size 
N'=M'=512  and  B'=16,  if  network  delay  is  the  performance  measure,  and  the  chip 
is  limited  to  90  pins,  then  selecting  a  single  bit  per  slice  rather  than  the 
optimum  eight  bits  per  slice  will  result  in  about  a  factor  of  four  increase  in 
delay. 

A  full  discussion  of  this  material  is  presented  in  Appendix  III  which  was 
presented  at  the  1981  International  Symposium  on  Parallel  Processing  and 
published  in  the  symposium  proceedings. 
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l.k  Synchronization  and  Partitioned  Networks 

All  communication  networks,  whether  partitioned  or  not,  must  have 
some  type  of  control  that  a)  allows  the  network  paths  to  be  established, 
and  b)  permits  the  orderly  transfer  of  data  from  a  source  to  a  destination 
port.  For  very  small  networks  that  do  not  require  partitioning  and  thus 
can  be  placed  on  a  single  VLSI  chip,  it  is  possible  to  design  an  efficient 
control  mechanism  that  has  a  central  physical  location  and  which  transmits 
its  control  information  over  special  paths  to  the  various  switching  and 
transmission  elements.  Once  the  network  size  increases  (either  by  adding 
additional  ports  or  by  increasing  the  number  of  bits  per  port)  to  a  point 
where  partitioning  is  necessary,  modularity  of  the  design  becomes  very  im¬ 
portant  and  the  use  of  a  central  control  structure  places  severe  limitations 
on  the  network  designer  and  on  the  network  performance.  This  is  a  result 
of  two  factors:  1)  with  modular  networks  the  size  of  the  control  structure 
is  usually  unknown  at  design  time,  thus  it  may  be  necessary  to  include  a 
large  control  structure  even  though  it  is  not  often  fully  utilized  (e.g. 
control  for  1000  ports,  even  if  only  50  ports  are  used)  and  2)  the  large 
size  of  the  partitioned  network  may  result  in  physically  long  paths  from 
the  central  control  to  the  individual  switching  elements,  thus  requiring 
excessively  long  times  for  path  establishment  and  data  transfer.  In  fact, 
the  longest  path,  and  thus  the  longest  time,  often  dictates  the  overall  net¬ 
work  performance. 

We  have  made  a  relatively  thorough  investigation  of  these  issues 
and  have  explored  how  one  might  employ  a  modular  switching  element  com¬ 
bined  with  a  distributed  self-timed  control  structure  for  such  partitioned 
communication  networks.  The  detailed  results  of  this  investigation  are 
presented  in  Appendix  IV,  where  we  show  the  specification  for  a  self-timed 
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switching  module  that  uses  local,  distributed  control  and  that  can  be  used  in 
a  modular  manner  to  create  networks  of  any  arbitrary  size.  Since  these  modules 
operate  by  responding  to  sequences  of  signals  (i.e.  signal  occurrences  must  be 
ordered)  rather  than  requiring  signals  to  be  separated  by  certain  time  inter¬ 
vals  (i.e.  signals  must  precede  other  signals  by  x  nanoseconds)  it  is  not 
necessary  for  the  designer  to  redo  or  readjust  the  logical  design  (and  hence 
the  VLSI  implementation)  as  the  network  size  changes.  Thus  they  offer  some 
significant  advantages  if  large,  variable  size  networks  are  to  be  constructed. 

One  unexpected  "synchronization"  problem  had  to  be  overcome  in  the  develop¬ 
ment  of  this  class  of  network  interconnection.  The  problem  appears  only  when 
the  bits  in  a  word  are  partitioned  into  separate  individually  controlled  bit 
planes.  As  mentioned  in  Section  2.3  in  many  situations  the  optimum  partition 
occurs  with  small  numbers  of  bits  per  plane  and  in  certain  cases  is  produced 
with  one  bit  per  plane.  To  understand  the  synchronization  problem,  consider 
a  system  having  many  ports,  each  with  an  8  bit  word  where  each  source  arid  des¬ 
tination  bit  interconnection  is  achieved  on  its  own  bit  plane  (e.g.  8  planes) 
Notice  that  if  two  sources,  say  S^  and  S^  are  concurrently  trying  to  establish 
a  path  to  the  same  destination,  say  it  is  possible  that  paths  from  to 

will  be  established  on  some  planes  and  paths  from  to  will  be  established 
on  other  planes.  This  is  a  direct  consequence  of  distribution  of  control  -  each 
switch  element  operates  in  an  autonomous  mode  (desirable)  but  operates  without 
a  knowledge  of  how  the  other  bits  in  the  word  are  being  treated  on  the  other 
planes  (undesirable).  Thus  it  is  possible  for  the  destination  to  receive  a 
mixture  of  bits  from  these  two  sources;  we  call  this  a  nonhomogenous  or  incon¬ 
sistent  word.  Notice  that  this  only  occurs  during  the  path  establishment 
phase;  once  the  path  is  established,  all  bits  are  transmitted  properly.  For¬ 
tunately,  we  have  been  able  to  develop  a  simple  procedure  that  allows  us  to 


detect  when  an  inconsistent  word  occurs  during  path  establishment  and  this 
can  be  used  to  generate  a  path  retry  request  for  that  port.  We  have  also 
made  a  theoretical  analysis  to  determine  how  often  such  a  problem  will  arise 
(as  a  function  of  certain  network  parameters,  such  as  switch  element  delay, 
request  rates,  etc.)  and  have  found  that  the  probability  of  generating  such 
an  inconsistent  path  is  normally  quite  small  (e.g.  pretry  <  0-07).  Finally 
we  point  out  that  central  control  does  have  one  advantage  over  distributed 
control:  it  requires  fewer  pins.  We  currently  are  investigating  this  issue 
and  are  considering  modules  that  are  still  locally  controlled  but  that  are 
"nearly"  self-timed. 


Research  into  quantifying  the  area  and  delay  properties  of  various 
network  types  when  layed  out  on  a  VLSI  chip  will  continue.  Some  of  this 
work,  as  described  in  sections  2.2  and  2.3,  is  near  completion  and  docu¬ 
mentation  of  the  results  is  now  underway.  The  models  developed  will  be 
used  in  part  to  investigate  the  properties  of  different  types  of  crossbar 
switch  design. 

Research  on  the  effects  of  VLSI  chip  pin  constraints  on  network  par¬ 
titioning  will  be  extended  to  consider  the  impact  of  data  pipelining  and 
path  blocking.  Computer  based  modeling  and  simulation  studies  will  be 
undertaken  in  this  area  with  the  goal  of  obtaining  partitioning  strategies 
which  are  near  optimum  over  a  wide  range  of  network  sizes  and  chip  pin 
constraints.  The  possibilities  of  developing  designs  for  a  switch  chip 
set  which  would  be  suited  to  a  variety  of  network  needs  and  sizes  will 
be  examined. 

The  problem  of  quantifying  the  impact  of  physical  constraints  (i.e. 
component,  pin,  power  densities)  at  the  chip,  board  and  rack  levels  on  the 
partitioning  of  VLSI  oriented  chip  arrays  will  be  explored.  While  the 
networks  to  be  considered  will  primarily  be  interconnection  networks,  other 
networks  more  oriented  towards  special  purpose  computation  will  also  be 
examined  as  time  permits. 

Studies  on  the  synchronization  of  bit  sliced,  plane  partitioned  networks 
will  be  completed  and  preliminary  research  on  the  effects  of  centralized 
versus  decentralized  control  of  networks  will  be  extended.  The  modularity, 
growth  and  reliability  properties  of  centralized  and  decentralized  control 
schemes  will  be  explored.  Based  on  the  results  of  the  above  research,  work 


will  begin  on  the  physical  implementation  of  a  VLSI  network  chip.  This 
will  act  as  a  testbed  for  a  number  of  the  research  ideas  developed. 

4.  Conclusions 

This  annual  report  has  documented  research  progress  and  achievements 
which  have  occurred  during  year  one  of  ONR  contract  N00014-80-C-0761 
entitled  "VLSI  Based  Multiprocessor  Communications  Networks".  The  work 
was  performed  at  the  Washington  University  Center  for  Computer  Systems 
Design,  St.  Louis,  Missouri.  This  work  has  been  motivated  by  the  potential 
for  increased  speed  and  high  reliability  associated  with  the  implementation 
of  multiple  processor  systems,  recognition  of  the  importance  of  the  inter¬ 
connection  network  over  which  the  processors  communicate,  and  by  the  avail¬ 
ability  of  new  design  options  afforded  by  the  ongoing  VLSI  technology  revo¬ 
lution. 

A  great  deal  of  progress  has  been  made  on  the  basic  research  tasks 
outlined  in  the  original  proposals.  These  accomplishments  span  the  develop¬ 
ment  of  formal  network  and  VLSI  based  models;  the  establishment  of  space 
and  time  performance  bounds  for  various  networks  when  implemented  on  a  single 
chip;  the  study  and  modelling  of  the  partitioning  of  large  networks  requiring 
many  chips  and  having  severe  pin  limitation  constraints;  the  analysis  of 
an  important  synchronization  problem  which  occurs  in  partitioned  networks, 
and  the  preliminary  functional  design  of  a  switch  component  which  solves 
the  synchronization  problem  while  maintaining  network  modularity  and  dis¬ 
tributed  control. 

Proposed  research  during  year  two  is  outlined  in  Section  3  of  this 
report.  Work  will  continue  on  tasks  begun  during  the  first  year.  Work 


will  be  initiated  on  a  number  of  new  tasks  including  the  general  modelling 


of  constraints  such  as  pin  limitations  over  several  levels  of  physical  de¬ 
sign;  the  design  of  network  chips  which  are  optimum  over  a  wide  range  of 
network  sizes  and  chip  pin  constraints;  the  study  of  centralized  versus 
decentralized  design  techniques;  the  effect  of  pipelining  on  various  aspects 
of  network  design;  the  further  study  of  synchronization  problems  which 
arise  in  VLSI  network  design;  the  preliminary  design  of  a  VLSI  network  chip 
or  chip  s°t. 

It  is  our  hope  that  the  research  momentum  of  the  first  year  will  con¬ 
tinue  and  that  significant  progress  will  be  made  during  year  two  of  the 
contract. 
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Interconnection  Networks 
A  Model  and  Survey 

Krishnan  Padmanabhan 

1.0  Introduction 

Due  Co  LSI  and  VLSI  technologies  there  now  exists  an  abundance  of  low 
cost  digital  logic  and  microprocessor  components.  These  components  may 
have  substantial  logical  complexity,  and  are  rapidly  becoming  the  basic 
building  blocks  in  the  design  of  more  powerful  computers  and  communications 
systems.  In  the  computer  area,  high  computational  power  appears  achieveable 
through  the  development  of  multiple  processor  systems  (1,2, 3, 4).  Central 
to  such  multiple  processor  systems  is  the  connection  network  over  which 
the  processors  exchange  information.  A  poorly  designed  network  can  rapidly 
become  a  performance  bottleneck  as  the  number  of  processors  and  traffic  in 
the  system  increases.  Very  high  bandwidth  networks,  on  the  other  hand,  can 
be  extremely  costly,  and  may  indeed  have  a  cost  which  grows  faster  than  the 
growth  of  processors  in  the  systems.  Under  these  conditions,  the  cost  of 
the  network  may  dominate  the  cost  of  the  overall  multiprocessor  system. 

This  surging  interest  in  multiprocessor  systems  has  led  to  a  renewed 
interest  in  the  design  of  interconnection  networks.  Most  of  the  earlier 
work  in  networks,  particularly  nonblocking  networks  has  been  carried  out 
in  the  field  of  telephone  switching.  This  is  not  surprising  for  telephone 
exchanges  had  been  the  most  complex  digital  system  in  existence  for  quite 
some  time.  Fortunately  however,  most  of  the  theoretical  work  done  against 
this  background  can  be  transferred  to  a  discussion  of  the  issues  involved  in 
networks  of  computers.  The  reason  for  this,  in  a  major  part,  is  the  fact 
that  the  complexity  of  a  switching  network  (a  term  that  itself  needs 
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elaboration)  is  dependent  mainly  on  the  combinatorial  and  topological 
properties  of  the  network,  rather  than  on  factors  related  to  the  application. 
However,  the  latter  considerations  are  definitely  important  while  choosing 
a  specific  network. 

Two  principal  modes  of  network  communications  can  be  identified.  The 
first  is  referred  to  as  the  circuit  switched  mode,  and  has  been  the  dominant 
one  used  in  telephone  switching.  In  this  mode  connections  are  provided 
between  a  requesting  set  of  inputs  and  the  desired  outputs  by  opening  and 
closing  switches  or  crosspoints.  A  path  that  is  thus  set  up  between  an 
input  and  an  output  is  "held"  until  the  transaction  is  completed.  Being 
held,  no  other  transactions  can  use  that  path,  and  such  transactions  are 
thus  blocked  if  they  require  that  path,  or  any  part  of  it.  The  second 
communications  mode  is  called  the  packet  -  switched  mode.  In  the  packet 
switched  mode  of  communication  the  input  or  the  sender  puts  the  information 
to  be  sent  to  the  receiver  in  a  packet  which  contains  the  address  of  the 
destination.  The  interconnection  network  routes  this  packet  to  its  destination 
in  a  manner  that  depends  on  the  current  network  status.  The  packet  Itself 
moves  through  the  network  "holding"  only  a  part  of  the  path  it  traverses 
at  any  point  in  time.  The  efficiency  of  this  scheme  derives  from  two 
sources.  First  by  holding  only  part  of  the  path  during  tranmission,  a 
smaller  part  of  the  network  is  tied  up  by  the  ongoing  transaction.  Other 
transactions  can  use  that  part  of  the  packet's  path  which  has  been  released 
as  the  packet  moves  through  the  network.  Second,  there  is  generally  no 
unique  route  between  input  and  output.  A  requesting  unit  need  not  therefore 
be  tied  down  to  the  availability  of  a  single  complete  path  to  initiate 
communication.  These  are  clearly  advantages  of  the  packet  mode  over  the 
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circuit  switched  mode  of  communication.  However  there  are  a  number  of 
disadvantages,  primarily  with  respect  to  higher  logic  and  hardware  costs. 
Furthermore  there  are  a  number  of  specific  computer  network  applications 
where  circuit  switching  is  clearly  faster  than  packet  mode  communications. 
It  is  not  the  objective  of  this  paper  to  compare  the  performance  of  these 
different  approaches  and  the  discussion  to  follow  will  be  restricted  to 
networks  operating  in  the  circuit  switched  mode. 

The  next  section  presents  a  general  approach  to  modeling  such  networks 
in  terms  of  graphs  and  states.  This  is  followed  by  a  brief  discussion  of 
certain  issues  relating  to  network  complexity.  A  network  classification 
scheme  is  then  given,  and  some  results  relating  to  the  blocking  character¬ 
istics  of  networks  presented.  A  more  through  discussion  of  selected  net¬ 
works  and  their  complexity  is  considered  next  and  the  entire  paper  is  then 
summarized  and  concluded.  The  paper  itself  is  a  condensation  of  material 
found  in  reference  20. 
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2.0  Interconnect ion  Network  Models  -  Graphs  and  States 

To  be  able  to  prove  certain  facts  about  networks,  a  formalized  notion 
of  a  network  la  needed.  A  natural  way  to  define  a  network  Is  as  a  directed 
graph  with  nodes  representing  either  an  input  or  output  module  or  a  switch¬ 
ing  element  and  each  directed  edge  standing  for  a  physical  link  in  the 
network.  Choosing  a  digraph  means  we  are  restricting  ourselves  to  uni¬ 
directional  communication  over  a  set  of  paths  (which  are  collectively 
represented  by  an  edge) .  This  really  does  not  restrict  the  model  because 
a  link  between  two  elements  that  permits  bidirectional  transmission  can  be 
represented  by  two  oppositely  directed  edges.  It  will  be  shown  that  this 
model  is  both  sound  and  complete  in  the  sense  that  there  exists  a  one  to 
one  correspondence  between  the  set  of  all  finite  digraphs  and  the  set  of 
Interconnection  networks . 

A  switching  network  N  can  be  represented  as  a  digraph  G^CV^E^) 
where  is  the  vertex  set  and  E^  is  the  edge  set. 

The  vertex  set  VN  is  defined  as: 

VN  -  1  U  SE  U  0 

I  is  a  nonempty  set  of  vertices  with  each  i  (  I  standing  for  an  input 
module  in  N  .  0  is  a  nonempty  set  of  vertices  with  each  o  €  0  repre- 
senting  an  output  module  in  N  .  SE  is  a  switching  element  (module)  set 
with  each  se  t  SE  standing  for  an  individual  switching  element  in  the 
network.  The  switching  element  is  defined  as  having  n  inputs  and  m 
outputs  (n,m  ^  1)  with  the  capability  of  connecting  any  input  to  any  output. 
Note  that  if  SE  is  empty,  it's  not  necessary  for  each  of  I  and  0  to 


be  nonempty,  but  just  that  I  U  Q  be  j  <p  .  This  is  the  classic  case  of 
the  interconnection  where  the  processors  communicate  with  each  other  directly 
through  links  without  any  switching  elements.  In  this  case  it  may  not  be 
possible  to  identify  input  and  output  modules  (e.g.,  star  connection  or 
ring) . 

Given  a  set  of  vertices  V^,  with  elements  being  designated  as  v^, 
it  is  now  necessary  to. specify  how  these  elements  are  connected  together. 

Let  Ejj  be  the  set  of  directed  edges  { e^}  .  The  connection  between  and 

is  defined  in  terms  of  an  incidence  function  $  .  If  an  output  of  the  module 
corresponding  to  v  (i,se,  or  o)  is  connected  to  an  input  of  the  module  corresponding 
to  Vj  then: 

^(e)  -  v±v 

Some  examples  of  standard  interconnection  structures  and  their  digraph 
equivalents  are  presented  in  Figures  1,2  and  3. 

Network  graphs  defined  in  this  manner  have  a  number  of  fairly  evident 
properties.  These  are  presented  below  without  proof. 

1.  If  each  switching  element  is  of  size  nxm  (i.e.,  n  inputs  and 

m  outputs) ,  then  the  in-degree  of  each  vertex  of  the  type  se 

is  n  and  the  out-degree  of  any  vertex  of  the  same  type  is  m  . 

2.  In  general,  the  in-degree  of  a  vertex  of  type  i  is  zero  and  the 
out-degree  of  a  vertex  of  type  o  is  also  zero. 

3.  Vi,  Vo,  out-degree  of  i  -  in-degree  of  o  -  1  . 

4.  Any  network  graph  with  I  O  0  ■  $  is  cycle  free.  This  can  be 

proved  using  properties  2  and  3.  Also  the  number  of  cycles  in 

the  graph  is  equal  to  |(I  O  0)|  . 


e  € 


h 


Vvi  ‘  VN 


5.  This  representation  for  a  communication  network  is  both  sound  and 
complete:  that  is,  for  every  network  there  corresponds  a  unique 
digraph  and  a  digraph  with  properties  2,3,  and  4  always  corresponds 
to  a  'valid'  communications  network. 

6.  In  N,  input  k  can  'access'  output  l  if  and  only  if  there 
exists  a  (directed)  path  between  1^  and  o£  in  G^  , 

The  subgraph  S  of  with  V(S)  ■  SE  will  be  called  the  switch  graph. 

Basically  this  is  the  network  graph  with  the  input  and  output  nodes  (and 
their  associated  edges)  removed.  More  often  than  not,  it’s  the  switch  graph 
that  we  are  Interested  in  because  it  is  the  structure  that  captures  the 
intrinsic  topological  properties  of  the  network.  Note  that  G  may 

w£i 

indeed  be  the  null  graph,  if  the  interconnection  does  not  make  use  of  any 
switches.  However,  for  our  purposes,  we  will  be  interested  in  networks  for 
which  SE  i*  $  . 

The  model  as  presented  above  is  concerned  only  with  the  'static' 
properties  of  the  network,  i.e.,  with  the  topological  aspects  external  to 
the  switch.  A  connection  is  set  up  between  an  input  unit  and  an  output 
unit  only  when  some  (generally  more  than  0)  switching  elements  are 
’set'  in  particular  positions.  This  is  a  dynamic  property  of  the  network 
since  it  depends  on  what  inputs  are  currently  active  and  to  what  outputs 
they  are  connected.  This  information  can  be  specified,  by  associating  with 
each  switch  node  a  table  that  gives  the  connections  established  within  that 
node.  The  table  need  not  contain  entries  corresponding  to  Inputs  and  out¬ 
puts  that  are  not  used  (not  connected) . 

Where  each  switching  element  has  two  inputs  and  two  outputs,  it  is  possible 
to  condense  the  information  in  the  table  to  just  one  bit.  Call  the  two 


inputs  to  the  switch.  1  and  2  and  the  two  outputs  1*  and  2' .  If  1  is 
connected  to  1'  then  the  only  connection  for  2  is  2'  ,  Similarly,  if 

1-2'  is  established,  then  2  can  be  connected  only  to  1*  ,  Thus  the 

switching  element  can  only  be  in  one  of  two  configurations  and  hence 
the  state  can  be  represented  by  a  single  bit.  A  0  will  represent  a 
'straight*  position  (1-1' ,2-2' )  while  a  1  will  represent  a  'cross' 
position  (1-2*  ,2-1').  This  fact  can  also  be  seen  in  another  way;  basically 
the  entries  in  the  table  specify  the  permutation  of  the  inputs  to  the 
switching  element.  If  the  switching  element  has  n  inputs  (and  n  outputs) 
there  are  a  total  of  nl  possible  entries  for  the  table.  For  a  2  input  - 
2  output  switch  then,  there  are  just  2  permutations,  one  of  which  we  repre¬ 
sent  by  a  0  and  the  other  by  a  1. 

We  are  now  in  a  position  to  define  the  state  of  a  network.  The  state 

is  given  by  the  ordered  pair  <GN,{Tge>>  •  GN  is  tlie  network  graph  defined 

earlier.  {Tge }  is  a  set  of  tables  that  specify  the  permutation  realized 
by  each  switching  element.  The  first  element  of  the  state  therefore  gives 
the  static  topological  characteristics  of  the  network.  The  second  element 
is  what  basically  captures  the  dynamic  properties  of  the  network.  In  fact 

it  is  possible  for  us  to  define  the  state  as  just  {T  }  since  G„  is 

se  N 

fixed.  G^  is  however  included  in  the  specification  of  the  state  for 

sake  of  completeness.  As  an  example  consider  the  mesh  network  of  Figure  1. 

If  the  input/output  paths  to  be  established  are  1-1',  2-3',  3-2', 

the  state  of  the  network  is  given  by  G„  and  {T  }  where  G„  is 

N  se  N 

specified  in  Figure  1  and  {T  }  is  given  below  along  with  the  switch 
input/output  labelling. 


A  dash  Indicates  a  don't  care  condition 
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3.0  Issues  of  Complexity 

The  complexity  of  a  system  made  up  of  interconnecting  a  large  number 
of  (atomic)  elements  has  two  aspects  to  it —  the  number  of  the  atomic 
components  and  the  way  they  are  interconnected  together.  Plppenger  (5) 
gives  a  good  account  of  complexity  theory  applied  to  telephone  switching. 
Although  the  complexity  of  such  a  system  is  very  much  greater  than  the 
sum  of  its  parts  (essentially  due  to  the  type  of  interconnection),  researchers 
have,  up  until  very  recently,  been  concerned  just  with  this  sum.  The 
reason  for  this  has  been  two  fold:  (1)  It  is  much  easier  to  analyze  the 
complexity  in  terms  of  obtaining  both  asymptotic  expressions 
and  lower  bounds  on  the  number  of  components  required  than  it  is 
to  quantify  the  complexity  of  interconnections  in  the  general  case;  and 
(2)  Till  recently  the  component  count  did  give  a  fairly  realistic  estimate 
of  overall  system  complexity.  As  Pippenger  points  out,  "...systems  that 
have  fewer  components  are  obtained  by  interconnecting  the  components  in  a 
more  intricate  way.  In  complexity  theory,  this  is  considered  an  advanta¬ 
geous  transformation."  Note  that  when  interconnection  networks  are 
entirely  contained  on  a  VLSI  chip  simple  component  count  measures  of 
complexity  are  inadequate.  In  general  this  is  because  the  topology  of 
the  network  can  have  a  pronounced  effect  on  the  area  required  to  lay  out 
the  network.  This  has  led  to  the  develpment  of  complexity  measures  which 
include  both  space  (area)  and  time  where  space  here  includes  space  taken 
up  both  by  the  active  components  and  the  signal  lines  connecting  the 
components  (6,7).  This  aspect  of  the  complexity  of  networks  is  not 
considered  in  this  survey.  Later,  in  section  6  many  of  the  classi¬ 
cal  results  pertaining  to  switch  count  complexity  measures  are  presented. 


-10- 


4.0  Network  Classification 

There  are  several  different  characteristics  and  performance  measures 
that  can  be  used  to  group  different  networks  in  equivalence  classes.  Malek 
and  Myre  (7)  discuss  several  of  them.  Some  of  the  more  common  ones  are 
switch  count,  delay  (in  terms  of  number  of  switches  In  a  data  path)  and 
set  of  permutations  (or  at  least  the  cardinality  of  that  set)  that  can  be 
achieved  by  a  network. 

Closely  related  to  the  permutations  that  can  realized  by  a  network  is 
the  concept  of  blocking.  Using  this  characteristic  various  network  schemes 
are  now  classified. 

Consider  a  (network)  graph  G^  .  An  Input/output  pair  In  N  can  be 
connected  If  a  path  exists  in  G^  from  the  input  node  to  the  output  node. 
However,  when  the  network  is  in  a  particular  state  (specified  by  {T  }) ,  it 

Sc 

may  not  be  possible  to  realize  this  path  in  the  network  if  it  is  not  per¬ 
mitted  by  the  T  ^  of  some  switch  node.  Consider  the  example  in  Figure  2. 

Assume  T  »  0  .  (i.e.,  the  straight  through  connection).  This  may 

seU 

be  because  we  might  currently  have  connected  processor  0  to  processor  2  . 

If  we  now  want  to  realize  a  path  between  processors  1  and  0  it  will 

require  T  to  be  1  which  is  incompatible  with  the  current  state  of 

sell 

the  network.  This  leads  to  what  is  known  as  blocking—  the  second  request 
for  connection  has  been  blocked  by  the  network.  In  general,  blocking  will 
occur  if  in  G  ,  two  1  -  o  paths  have  at  least  one  edge  in  common. 

It  is  easy  to  see  that  blocking  can  never  occur  in  the  networks  dis¬ 
cussed  in  examples  1  anc1  3  .  Such  networks  are  called  nonblocking  net¬ 


works.  In  networks  where  there  is  more  than  one  path  from  an  input  to  an 
output,  it  might  be  possible  to  avoid  blocking  by  choosing  one  or  another 
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of  Che  paths.  This  leads  to  different  kinds  of  blocking  exhibited  by 
networks.  These  are  discussed  briefly  now;  more  elaborate  results  will 
be  presented  a  little  later. 


Network  Classification  Based  on  Blocking 


(a)  Strict  sense  nonblocking  networks:  A  network  is  nonblocking  in 
the  strict  sense,  if  irrespective  of  what  state  it  is  in,  an 
’idle*  input  can  be  connected  to  an  'idle’  output.  This  means, 
in  GN  ,  given  any  set  of  paths  between  a  subset  of  inputs  and 
a  subset  of  outputs,  and  an  input  i^  and  an  output  o^, 

(i^  and  o^  are  not  members  of  the  subsets) ,  it  is  always 
possible  to  find  a  path  between  i^  and  o^  that  is  edge 
disjoint  with  all  the  other  paths.  The  networks  in  examples  1  and  3  are 
strict  sense  nonblocking  networks.  Example  1  is  the  extreme  case 
where  no  path  between  an  arbitrary  input  i^,  and  and  output 
o^  ,  has  any  edge  in  common  with  any  path  between  another  pair 


(b)  Wide  sense  nonblocking  networks:  In  this  type  of  networks  states 
do  exist  where  it  will  not  be  possible  to  establish  a  path  between 
an  unused  input  i  and  an  unused  output  o,  but  such  states  can 
be  avoided  if  a  certain  'rule'  for  routing  the  calls  is  followed. 
As  a  necessity,  such  a  property  implies  the  existence  of  multiple 
paths  between  an  i  -  o  pair.  By  following  such  a  specified  rule 
then,  the  networks  would  never  be  in  a  state  where  it  is  unable  to 
service  a  request  for  a  connection.  Examples  for  this  case  and 


also  conditions  for  a  network  to  exhibit  this  property  will  be 
derived  In  a  later  section, 

(e)  Rearrangeably  nonblocking  networks:  In  both  the  cases  above,  we 
could  route  an  Input  request  without  disturbing  the  paths  for  the 
transactions  currently  in  progress.  There  are  networks  where  in 
a  particular  state  we  may  be  able  to  service  a  request  only  if 
we  'rearrange*  the  existing  configurations  of  the  network,  i.e., 
by  modifying  {Tgg}  .  Such  a  network  is  nonblocking  in  that 
we  would  still  be  able  to  route  any  request,  but  it  might  incur 
a  high  cost. 

(d)  Blocking  networks:  A  network,  in  a  sense,  can  be  considered  to 

permute  the  set  of  Inputs  and  the  permutation  realized  is  given  by 
the  input/output  connection.  Nonblocking  networks  then  can  realize 
any  arbitrary  permutation  of  the  inputs.  A  blocking  network  is 
one  that  is  not  nonblocking  in  any  of  the  senses  mentioned  above. 

In  other  words,  there  do  exist  states  of  the  network  in  which  it 
will  not  be  possible  to  satisfy  a  new  request  in  any  way.  In  terms  of 
G^,  a  necessary  and  sufficient  condition  for  this  property  is  the 
existence  of  two  input/output  pairs  i^  -  o^  and  -  o^  such 
that  there  is  no  path  between  i^  and  o^  that  is  edge  disjoint 
from  at  least  one  path  between  ij  and  o,  . 
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5.Q  Nonblocking  Networks:  Some  Theoretical  Results 

One  of  Che  earliest  Interconnection  networks  ever  to  be  studied  and 
used  Is  the  crosspoint  switch.  While  its  use  now  in  large  systems  is 
limited  to  telephone  switching, it  provides  a  standard  against  which  to 
compare  other  interconnection  schemes.  An  n  x  m  crosspoint  switch  is, 
as  shown  in  Figure  4,  an  array  of  nm  contacts,  called  crosspoints ,  with 
n  inputs  and  m  outputs.  It  has  the  ability  to  connect  any  input  to  any 
output  in  any  state  the  switch  is  in.  Each  crosspoint  is  used  to  connect 
a  unique  input  to  a  unique  output.  It  can  be  considered  as  a  single  pole 
single  throw  switch  between  the  input  and  output  to  which  it  is  connected. 
Traditionally  electromagnetic  relays  did  the  job  in  telephone  switching. 

Pass  transistors,  less  than  a  millionth  of  their  size,  can  now  handle  the  job. 

The  crosspoint  switch  quite  obviously  is  a  strict  sense  nonblocking 

network.  An  N  x  N  square  network  that  is  strictly  nonblocking  can  thus 

2 

be  constructed  using  N  crosspoints.  (Note  that  we  have  implicitly 
assumed  the  number  of  crosspoints  in  the  network  to  be  our  measure  of 
complexity) .  Hence  crosspoints  can  be  taken  as  an  upper  bound  on 

the  complexity  of  a  strictly  nonblocking  N  x  N  network. 

The  next  logical  question  concerns  obtaining  the  lower  bound  on  the 
complexity  of  a  strictly  nonblocking  N  x  N  network.  A  theoretical  lower 
bound  to  this  value  was  given  by  Shannon  in  1950  ( 8 ) .  The  derivation 
basically  relates  the  minimum  number  of  states  the  network  should  be  able 
to  assume  to  the  number  of  inputs/outputs.  As  mentioned  in  section  2  with 
regard  to  Tgg,  an  N  x  N  network  has  to  be  able  to  realize  Nl  different 


permutations.  A  single  crosspoint  can  be  in  one  of  two  states —  open  or 
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closed  (note  chat  we  are  using  the  term  'state'  here  a  little  loosely!) 
and  hence  the  overall  network,  if  it  has  y  crosspoints,  can  exist  in  no 

v 

more  that  2  distinct  states.  Therefore: 

2Y  >,  N!  or  Y  1  logjN! 

Applying  Stirling's  approximation  to  N!  we  find  that  the  lower  bound  is  on 
the  order  of  N log^N  .  Hence  any  network  which  is  strictly  nonblocking 
with  N  inputs  and  N  outputs  must  have  at  least  on  the  order  of  N logjN 
crosspoints. 

Given  this  perspective  we  now  consider  the  actual  complexities  of 
various  networks.  Here  we  have  different  ways  of  proving  the  existence  of 
a  network  with  a  certain  complexity.  The  most  important  (in  terms  of 
practical  application)  is  the  constructive  technique  where  we  actually 
produce  the  particular  network.  We  can  thus  physically  build  the  network. 

The  Clos  construction  that  we  discuss  a  little  later  comes  under  this  category. 
It  is  interesting  to  see  what  other  methods  there  can  be  of  establishing 
facts  like  this.  In  1973,  L.A.  Bassalygo  and  M.S.  Pinsker  of  the  Institute 
for  Problems  of  Information  Transmission  in  Moscow  (9)  proved  that 
a  strict  sense  nonblocking  network  with  N  inputs  and  N  outputs  exists 
that  uses  just  0(NlogN)  crosspoints.  The  proof,  a  very  brief  outline 
of  which  follows,  does  not  say  how  such  a  network  can  be  built,  however. 

A  sparse  crosspoint  switch  is  obtained  from  a  regular  crosspoint  switch 
by  removing  many  crosspoints,  and  retaining  the  following  property:  every 
group  of  one  third  of  the  inputs  can  be  connected  to  more  than  two  thirds 
of  the  outputs.  Bassalygo  and  Pinsker  show  the  existence  of  N  x  N  sparse 
crosspoint  switches  that  make  use  of  just  12N  crosspoints  and  have  the 
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above  property.  They  use  a  combinatorial  argument  that  involves  considering 
the  set  of  all  possible  arrangements  of  the  12N  crosspoints.  Most  of 
these  arrangements  do  satisfy  the  connection  property  above  and  therefore 
should  result  In  valid  structures.  They  then  go  on  to  show  that  by  inter¬ 
connecting  sparse  crosspoint  switches,  it  is  possible  to  construct  strictly 
nonblocking  networks  that  use  0(NlogN)  crosspoints. 

A  sparse  crosspoint  switch  with  6  Inputs  and  6  outputs  is  shown  in 
Figure  5.  Here  we  need  less  than  12N  crosspoints  because  N  <  12.  For 
N  >  12,  every  input  has  crosspoints  between  it  and  12  of  the  N  outputs 
(resulting  in  a  total  of  12N  crosspoints) . 

Where  does  the  catch  in  the  above  process  lie?  The  problem  is  in 

testing  a  configuration.  Out  of  the  N  inputs,  there  are  NCN^3  way9  of 

choosing  a  third  of  them  and  an  equal  number  of  ways  of  choosing  two  thirds 

N  2 

of  the  outputs.  That  leaves  us  with  [  Cjj^]  different  choices.  This 
number  is  astrononmically  large  even  for  reasonably  large  N.  However, 
the  theoretical  proof  of  existence  is  quite  important  in  itself.  Pippenger 
(10)  has  refined  the  bound  on  the  number  of  crosspoints  to  90N  log^N  . 

It  was  Charles  Clos,  in  1953  (11)  who  first  came  up  with  the  design 

2 

of  a  strict  sense  nonblocking  network  that  uses  fewer  than  N  crosspoints 
(N  inputs/N  outputs) .  Basically  his  design  involved  interconnecting  columns 
(stages)  of  rectangular  crosspoint  switches  and  he  showed  that  given  any 
e  >  0,  by  using  a  sufficient  number  of  stages,  one  can  design  strict  sense 
nonblocking  networks  in  which  the  number  of  crosspoints  is  0(N  ).  We 

will  now  briefly  consider  Clos'  analysis. 
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The  basic  approach  in  the  Clos  design  is  to  apply  the  divide-and-conquer 
paradigm  to  the  overall  network  required.  A  large  network  is  built  out  of 
a  number  of  subnetworks.  Note  that  in  the  crosspoint  switch,  every  path 
involved  the  switching  of  just  one  crosspoint.  If  we  considered  the  number 
of  crosspoints  in  the  path  to  be  a  measure  of  delay,  the  crosspoint  switch 
provides  us  with  the  fastest  switching  mechanism —  constant  delay  independent 
of  network  size.  The  saving  in  crosspoint  count  in  the  Clos  design  comes 
from  increasing  this  delay  by  a  fixed  parameter.  Inputs,  rather  than  being 
switched  by  a  single  crosspoint  are  now  switched  by  a  series  of  subnetworks. 
Note  that  these  subnetworks  can  be  crosspoint  switches,  if  they  are  small 
enough. 

A  general  three-stage  Clos  network  is  shown  in  Figure  6.  The  three 
stages  can  be  considered  to  be  input,  output,  and  Intermediate  switching. 

Any  input-output  path  must  be  successively  switched  by  a  crosspoint  in  the 
input,  Intermediate,  and  output  stages  respectively  leading  to  a  delay  of 
three  crosspoints.  (We're  assuming  that  each  subnetwork  is  implemented  as  a 
crosspoint  switch —  this  need  not  be  the  case.  We  can  also  recursively  apply 
the  divide-and-conquer  paradigm  to  the  intermediate  stages) . 

The  next  question  is  how  to  get  the  parameters  n,m  and  r  so  that 
the  number  of  crosspoints  is  minimized.  The  only  fact  we  do  know  is  that 
the  product  of  n  and  m  equals  N.  The  number  of  switching  networks,  r, 
required  in  the  intermediate  stage  must  be  sufficient  to  avoid  blocking  under 
the  worst  set  of  conditions.  What  is  the  worst  case  condition?  Assume  that 
we  have  (n  -  1)  requests  currently  being  serviced  at  an  input  switch  and 
a  new  request  comes  in.  The  (n  -  1)  existing  requests  go  off  via  a  cross- 
point  each  to  (n  -  1)  switches  in  the  intermediate  stage.  Consider  the 


destination  of  the  new  request  and  the  output  switch  which  contains  that  desti¬ 
nation.  The  worst  case  here  is  when  all  the  (n  -  1)  outputs  at  this 
switch  other  than  the  desired  destination  are  busy  and  are  connected  to 
(n  -  1)  switches  in  the  intermediate  stage  different  from  the  (n  -  1) 
switches  previously  considered.  At  this  stage  one  more  switch  is  needed 
in  the  middle  column  to  service  the  new  request.  Hence  given  (2n  -  1) 
switches  in  the  intermediate  switch,  it  is  possible  to  service  a  request 
whatever  state  the  network  is  in.  Thus,  for  m  >_  2n  -  1  ,  the  Clos  network 
is  nonblocking  in  the  strict  sense. 

The  number  of  crosspoints  required  by  the  three  stage  Clos  network 
(which,  incidentally,  is  denoted  by  C(3)  )  is  given  by: 

2 

C(3)  *  2rnm  +  mr  [1] 

Noting  that  nr  =  N  and  arbitrarily  choosing  n  *  >^N  one  obtains: 

C(3)  -  2N(2N1/2  -  1)  +  N(2N1/2  -  1)  -  6N3/2  -  3N 

Note  that  this  is  a  significant  reduction  in  cost.  For  N  >_  36,  this  value 
2 

is  less  than  N  .  The  value  of  n  which  minimizes  [1]  can  be  found  by 
rewriting  [1]  as: 

2  N2 

C(3)  *  (2n  -  1) (2N  +  r  )  -  (2n  -  1) (2N +-2j)  [2] 

n 

Differentiating  [2]  and  setting  zero  gives 

2n3  -  nN  +  N  =  0  [3] 

For  large  N  (and,  large  n) ,  [3]  can  be  approximated  by 
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2n 


2 


-  N  - 


0 


or 


Substituting  this  value 


n 


(N/2) 


1/2 


of  n  in  [2]  give  us 
C(3)  '  4N(21/2N1/2  -  1) 


The  next  step  is  to  apply  recursion  to  the  Intermediate  stages  of  networks 
too  and  synthesize  them  using  smaller  size  networks.  Clos  shows  that  the 
casts  of  5,7,  and  9  stage  networks  are  given  by: 

C(5)  -  16N4^3  -  14N  +  3N2^3 

C(7)  *  36N5/4  -  46N  +  20N3/4  -  3N1/2 

C(9)  -  76N6/5  -  130N  +  86N4/5  -  26N3/5  +  3N2/5 


(The  number  of  stages  considered  is  always  odd  because  that's  the  way  the 
networks  grows:  Each  time  the  Intermediate  stage  is  decomposed  into  three 
stages.)  Note  that  by  going  to  a  sufficiently  large  number  of  stages,  we 
can  make  the  order  of  magnitude  growth  as  near  to  N  as  we  wish.  But  however 
small  e  is,  N3+e  still  grows  faster  than  N  log  N  (even  though  it  stays 
less  than  N log  N  until  N  becomes  very  large) . 

David  Cantor  (12)  analyzed  the  Clos  network  as  it’s  designed  to 

1/2 

the  limits  of  recursion  and  found  it  takes  on  the  order  of  Ne2^°8 N  lo8  2) 

2  *269' • ■ 

crosspoints  which  simplifies  to  0(N(logN)  ).  Cantor  himself  proposed 

an  interconnection  scheme  involving  subnetworks  (crosspoint  switches)  arranged 

2 

in  more  than  three  stages  that  grows  on  the  order  of  N(logN)  .  The  design 


is  much  more  complicated  than  the  Clos  network  and  will  not  be  presented  here. 
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Note  that  the  advantage  of  the  reduced  exponent  in  the  complexity 
figures  cited  above  does  not  appear  until  N  gets  very  large.  The  Clos 
network  with  three  or  five  stages  offers  a  fair  compromise  between  the 
crosspoint  count  and  the  issues  in  complexity  that  are  not  considered  in 
this  figure  (like  the  interconnection  of  the  components) .  This  concludes 
the  discussion  of  strict  sense  nonblocking  networks . 

Practically  useful  wide-sense  nonblocking  networks  have  not  yet  been 
found.  One  example  has  been  cited  by  Benes  (13).  Consider  the  three  stage 
Clos  network  (Figure  6)  with  r  ■  2  .  If  the  rule  that  an  empty  middle 
switch  is  not  to  be  used  unless  there  is  no  partially  filled  middle  switch 
that  will  route  the  request  is  used,  then  it  can  be  shown  that  no  call  is 

blocked  by  the  switch  for  m  >  T3n/2l.  Thus  the  three  stage  Clos  network 

with  r  <*  2  is  wide-sense  nonblocking  for  m  >  T3n/2l. 

That  the  Clos  network  is  rearrangeable  for  m  ^  n  was  proved  by 
A.M.  Duguid  in  1959  (14).  The  proof  is  based  on  a  combinatorial  argument 
that  goes  on  to  show  that  any  permutation  of  the  N  inputs  can  be  realized 

by  a  configuration  of  the  network  that  makes  use  of  just  n  switches  in 

the  middle  row.  It  has  been  proved  by  M.C.  Pauli  in  1962,  that  in  the  Clos 
network  with  m  -  n  *  r,  at  most  (n  -  1)  existing  input-output  paths  need 
be  rerouted  in  order  to  connect  an  idle  pair  of  terminals. 


6.0  Blocking  Networks 


Blocking  interconnection  networks  by  defintion  have  the  property  that 
there  do  exist  i  -  o  pairs  in  their  graphs  which  are  not  connectable  when 
the  network  is  in  some  particular  state.  This  can  arise  from  two  cases. 

(i)  It  is  possible  that  there  is  no  path  at  all  between  ij  and  ofc 
in  G^.  In  this  case  it  is  never  possible  to  connect  input  j  and  output  k 
no  matter  what  state  the  network  is  in.  We  call  such  networks  as  strong 
sense  blocking  networks.  One  such  network  configuration  is  the  'star'  whose 
graph  is  shown  in  Figure  7.  Though  not  a  very  practical  scheme  in  large 
systems,  it  still  is  a  perfectly  valid  example.  Note  that  SE  ■  <t>  in  this 
case,  so  the  nodes  are  labelled  as  p's.  In  this  case  processors  I  and  j 

are  connectable  If  and  only  if  there  exists  a  link  l  of  the  form 

l  l  ll  l2 

p.^  ■*  Pj  or  Pj  •*  p^^  in-  G^.  (In  particular  a  path  of  the  form  -*•  p^  -*■  pfc 

even  if  present,  does  not  connect  i  and  k  in  the  network.)  Thus  we  see 

that  none  of  the  'slave'  processors  (p  )  can  communicate  with  each  other 

3  i 

(directly)  while  the  'master'  (pm)  can  interact  with  all  the  slaves.  Some 
other  examples  of  this  class  of  networks  are  the  mesh  type  interconnection 
(as  In  the  Iliac  IV)  (  2 ) ,  the  chordal  ring  network  ( 15)  and  the  tree  config¬ 
uration  (the  X-tree  (16),  for  example). 

Define  the  underlying  graph  of  a  digraph  G  to  be  the  graph  obtained 
from  G  by  deleting  the  directions  from  the  edges,  i.e.,  by  replacing  directed 

edges  by  undirected  edges.  It  then  is  easy  to  see  that  if  SE  ■  $  in  G„, 

N 

then  N  will  not  be  strong  sense  blocking  if  and  only  if  the  underlying 
graph  of  Gm  is  the  complete  graph  K  (where  n  is  the  number  of  nodes  ■ 
number  of  processors).  This  is  a  reasonably  strong  condition  to  satisfy, 
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especially  for  large  n;  ocher  than  the  number  of  edges  Itself  growing  on 
2 

the  0(n  ),  each  node  will  have  to  grow  in  size  as  0(n) —  an  almost  complete 
antithesis  of  the  design  philosophy  for  large  scale  integration.  However, 
the  strong  sense  blocking  characteristic  of  a  network  does  not,  quite  often, 
pose  great  problems.  This  is  especially  true  where  a  multiprocessor  system  is 
to  be  used  (more  often)  in  specific  applications  l.e.,  the  alignments  required 
of  the  network  (by  alignments,  we  mean  the  set  of  p^  -  p^  paths  that  have 
to  exist  simultaneously  in  the  system)  are  known  a  priori  and  the  links  have 
been  set  up  with  this  requirement  in  mind. 

(ii)  In  this  case,  every  i  -  o  pair  is  connected  in  G^,  but  a  specific 


implies,  by  necessity,  the  existence  of  a  nonempty  SE.  This  more  common 
form  of  blocking  is  exhibited  by  most  proposed  interconnection  schemes  and 
is  what  we  will  be  concerned  with  in  the  rest  of  this  section.  We  call  this 
type  of  blocking  as  weak  sense  blocking.  More  formally,  a  network  is  weak 
sense  blocking  if  and  only  if  there  exist  states  of  the  network  in  which  it 
is  not  possible  to  connect  an  i  -  o  pair  and  for  each  such  i  -  o  pair, 
there  exists  at  least  one  other  state  in  which  it  can  be  connected. 

How  does  weak  sense  blocking  differ  from  the  rearrangeably  nonblocking 
networks  discussed  in  section  5?  In  the  case  of  a  rearrangeably  nonblocking 
network,  for  every  state  of  the  network  in  which  it  is  not  possible  to  connect 
an  i  -  o  pair,  there  exists  at  least  one  other  state  in  which  (a)  all 
(i  -  o)  connections  in  the  first  state  are  maintained  in  the  second  (though 
not  necessarily  using  the  same  paths)  and  (b)  the  new  i  -  o  pair  is  also 
connected.  Condition  (a)  is  not  necessary  to  be  satisfied  by  a  weak  sense 
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blocking  network.  In  fact,  it  is  not  satified  by  a  weak  sense  blocking 
network,  as  otherwise  the  network  would  be  rearrangeably  nonblocking. 

Dote  that  in  classifying  network  schemes  based  on  their  blocking 
characteristic,  we  include  a  network  in  the  highest  class  of  nonblocking 
networks  possible.  (We  consider  a  weak  sense  blocking  network  to  be  more 
'nonblocking'  than  a  strong  sense  blocking  network.)  What  we  have  esta¬ 
blished  then  is  a  hierarchy  of  equivalence  classes  based  on  the  blocking 
characteristic.  Any  network  in  a  particular  class  has  the  properties  of  all 
the  classes  below  it  in  the  hierarchy. 

There  is  a  plethora  of  networks  of  this  type  that  have  been  proposed. 

All  of  these  use  multiple  stages  of  switching  elements  and  hence  differ  basi¬ 
cally  only  in  the  type  of  connection  between  the  stages.  Three  of  these  are 
shown  in  Figures  8,9,  and  10 —  the  Omega  network,  the  Indirect  Binary  3-cube 
network,  and  the  banyan  network.  Note  that  while  the  stages  themselves  are 
'active',  the  connection  between  two  stages  is,  in  a  sense,  'passive'.  Most 
of  these  connections  are  variations  of  a  basic  structure  known  as  a  shuffle. 
Harold  Stone,  in  (17)  discusses  the  applications  of  the  'perfect'  shuffle 
in  parallel  processing. 

Define  an  r*m  shuffle  Tig),  denoted  by  S  ,  where  r  and  m  are 
————————  r*m 

some  positive  integers,  to  be  the  following  permutation  of  rm  indices 
<0,1, ...» (rm-  1)>: 


W1’  -<rl+£ 


|^J)mod(ra) 


0  <  i  <  rm  -  1 


where  S  (i)  is  the  position  of  i  after  the  shuffle.  An  r*m  shuffle 
r*m 

can  be  viewed  as  a  shuffle  of  rm  cards  in  a  deck  in  the  following  manner: 


Divide  the  deck  into  rblocks  of  m  cards  each.  The  order  of  the  cards  in 
the  r*m  shuffled  version  is  given  as  follows:  The  first  r  cards  are 
the  first  from  each  of  the  r  blocks;  the  second  r  cards  the  second  from 
each  of  the  r  blocks  and  so  on. 

Consider  again  the  3  stage  Clos  network  shown  in  Figure  6.  The  connection 
between  the  input  and  intermediate  stages  is  the  r*m  shuffle.  That  between 
the  intermediate  and  output  stages  is  the  m*r  shuffle  (which,  incidentally, 
is  the  inverse  of  the  r*m  shuffle) . 

N 

The  'perfect*  shuffle,  as  defined  by  Stone,  is  then  the  2*2  shuffle. 

Its  pattern  is  shown  in  Figure  11.  The  permutation  7T  is  given  by 

JT(i)  -  2i  0  <  i  <  N/2  -  1 

-  2i  +  1  -  N  N/2  <_  i  <  N  -  1  ...4 

4  then  tells  us  that  if  the  index  of  an  input  i  is  represented  in  binary 
notation  (of  log^N  bits),  it's  position  after  the  perfect  shuffle  is  given  by 
cyclically  rotating  the  bits  of  i  one  position  to  the  left.  This  indicates 
that  logjN  successive  shuffles  of  a  vector  i  will  return  to  us  i  back. 

The  Omega  network  (Figure  8)  then  consists  of  log2N  switching  stages 
with  log^N  perfect  shuffles  connecting  the  stages.  The  Indirect  binary 

n-cube  is  a  Banyan  network  (with  2n  inputs)  with  an  inverse  shuffle  appended 
at  the  end.  This  lets  the  IBNC  realize  the  identity  permutation  which  the 
Banyan  network  cannot. 

The  Indirect  Binary  n-cube  network  derives  its  name  from  the  fact  that  it 

can  simulate  the  transfers  along  the  edges  of  a  binary  n-cube,  Q^,  which 

is  defined  as  follows:  The  vertices  of  Q  are  the  n- tuples  of  0  and  1 

n 

and  there  is  an  edge  between  two  vertices  iff  their  n-tuples  differ  in  are 
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and  only  are  component.  A  binary  4-cube  is  shown  in  Figure  12.  Each  stage 
in  the  IBNC  network  is  identified  with  one  axis  of  the  binary  n-cube.  The 
transfers  along  a  particular  axis  (if  we  assume  a  processor  at  each  node  of 
Q^)  are  obtained  by  setting  the  switches  in  the  corresponding  stage  in  the 
'cross'  position  and  all  other  switches  in  the  'straight'  through  position. 

Of  course,  the  I8NC  can  realize  a  lot  more  permutations  than  are  permitted 
hy  the  connections  of  the  n-cube. 

Each  of  the  three  networks  considered  above  is  'self-routing';  that  is, 
the  input  message  along  with  the  destination  address  routes  itself  through 
the  network.  This  is  typically  done  as  follows.  The  paths  from  one  particu¬ 
lar  input  to  all  the  outputs  form  a  binary  tree  rooted  at  the  input  switch 
corresponding  to  the  input.  Each  node  is  a  switching  element.  This  is 
basically  the  demultiplexer  tree  for  the  log^N  bits  which  encode  the  address 
of  the  destination.  Each  switch  along  the  path  strips  one  bit  off  the  address 
(the  most  significant  bit),  examines  it  and  sets  itself  based  on  the  bit. 
Typically,  a  0  sets  the  switch  in  the  straight  position  while  a  1  sets  dt  in  the 
cross  position.  Note  that  the  concept  is  identical  to  that  of  positional 
trees.  (18) 

We  conclude  this  section  with  a  brief  discussion  of  the  probability  of 
blocking.  The  stochastic  model  assumed  for  the  entire  system  (which  includes 
the  processor  and  memory  modules)  is  important  here  in  the  final  results 
that  will  be  obtained.  Patel  (19)  presents  a  model  in  which  each  processor 
generates  random  and  independent  requests  which  are  uniformly  distributed  over 
all  memory  modules.  The  probability  of  accepting  in  this  case  (assuming  a  mean 
request  generation  rate  per  cycle  of  1 —  a  saturated  system)  is  shown  in 


% 
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Figure  13a.  An  alternate  model,  which  considers  the  blocking  charateristic 
introduced  by  the  network  alone  (that  is,  leaves  out  the  blocking  inherent 
in  input  contention)  has  been  considered  by  us  where  in  we  assume  the  input 
address  unique  output  ports  in  each  cycle.  Simulation  results  for  this  model 
are  presented  in  Figure  13b.  Analytical  results  for  the  latter  case  are  more 
difficult  to  obtain.  The  figures  presented  are  for  networks  of  the  type  Omega, 
IBNC  or  Banyan. 

7.0  Summary  and  Conclusions 

Increasing  interest  and  design  of  multiprocessor  systems  has  led  to  a 
renewed  concern  with  the  properties  of  interconnection  networks.  These  networks 
though  originally  studied  in  the  context  of  telephone  systems  are  being  re¬ 
examined  with  multiprocessor  systems  in  mind. 

This  paper  has  presented  a  review  and  classification  of  some  of  the 
principal  networks  studied.  The  classification  scheme  is  primarily  based  on 
the  blocking  characteristics  of  the  network.  In  addition  some  discussion  of 
network  complexity  was  presented.  Further  research  work  is  needed  in  this 
area.  In  particular  the  effects  of  changing  technology  (e.g.,  VLSI)  need  to 
be  investigated  in  terms  of  how  the  new  characteristics  of  these  technologies 
relate  to  network  performance.  In  addition  it  is  not  yet  understood  just  how 
to  relate  network  properties  to  applications,  requirements  in  any  general 
manner.  These  and  associated  questions  will  be  the  examined  for  some  time  to 


come. 
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IBNC  Network 


An  Indirect  Binary  2~cube  Network 


Graph  of  1B2C  Network 


Figure  2:  The  Indirect  Binary  2-cube  network  interconnecting  four 
processing  elements.  In  this  case,  1*0= 

P  *  {pQ.p^.p^.p^}*  Note  that  a  path  from  any  p-vertex  to 
any  other  p-vertex  has  a  length  of  2.  In  general  it  is  n 
long,  where  n  »  log|p|.  Also  note  that  such  a  path  is  unique. 
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Figure  3:  A  4x4  Clos  Network  (a)  and  its  graph  (b) .  It  is  a 
strict  sense  nonblocking  network  (a  term  that  is 
explained  in  text.) 


Figure  6:  Three  stage  Clos  network. 

Each  box  is  a  subnetwork  (crosspoint) 
of  the  order  shown. 


Figure  7:  Network  graph  of  a  tree  configuration. 
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Figure  4:  An  n x m  crosspoint  switch 


Figure  5:  A  6x6  sparse  crosspoint  switch 
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Figure  8:  An  8  x  8  Omega  Network 


Figure  9:  The  indirect  binary  3-  cube  array 
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Figure  12:  The  edge  assignment  in  a  binary  4-  cube  network. 

Note  that  every  edge  connects  two  nodes  whose  indices 
differ  in  exactly  one  bit.  If  we  assume  a  processor  in  each 
node,  the  alignments  along  the  edges  parallel  to  any  one  axis 
are  obtained  by  setting  in  the  IB4C  network,  the  switches  in 
the  corresponding  column  in  the  cross  position  and  all  the 
rest  in  the  straight  position. 
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Figure  14:  Message  Blocking  Probability  CP^)  versus  Network  Size 
(LOG2N)  (for  banyan  networks  with  unique  and  uniformly 
distributed  destination  addresses) 
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ABSTRACT 

Interest  In  tighcly  coupled  multiprocessor 
computer  systems  has  grown  as  che  possibilities  for 
high  performance  with  such  systems  have  been  recog¬ 
nized.  Central  to  their  design  is  the  structure  of 
the  network  over  which  the  processors  communicate. 
Unless  properly  designed,  such  networks  can  become 
boch  a  cost  and  performance  bottleneck.  This  paper 
focuses  on  che  design  of  VLSI  communications  net¬ 
works,  that  is,  on  communications  networks  which  can 
be  placed  on  a  single  VLSI  chip.  Traditional  SSI 
based  cost  and  complexity  measures  for  such  net¬ 
works  have  principally  involved  switch  aggregate 
counts.  In  a  VLSI  domain,  however,  more  appro¬ 
priate  measures  involve  chip  area,  and  space-time 
product.  The  effects  of  network  topology  and  VLSI 
layouc  on  these  measures  are  reviewed  with  regard 
to  two  network  types.  Another  important  question 
related  to  the  VLSI  communication  necwork  problem 
relates  to  chip  pin  constraints.  This  problem  is 
discussed  and  some  effects  and  options  presented  by 
bit  slice  network  designs  are  described. 

INTRODUCTION 

In  recent  years  there  has  been  increasing  in¬ 
terest  in  tightly  coupled,  physically  local  multi¬ 
processor  computer  systems  (1,2).  This  has  been  due 
both  to  the  enhanced  performance  possibilities  for 
such  systems,  (e.g.,  increased  computational  power 
resulting  from  parallel  processing  and  higher  reli¬ 
ability  resulting  from  component  redundancy)  and 
the  steady  decrease  in  hardware  costs  associated 
with  these  systems.  A  central  issue  in  the  design 
of  such  systems  concerns  performance  degradation 
due  to  costs  associated  with  interprocessor  commu¬ 
nication.  One  aspect  of  this  problem  relates  to 
the  question  of  user  problem  decomposition  and 
scheduling  0,4).  Anocher  relaces  to  the  structure 
and  design  of  the  necwork  over  which  the  multiple 
processors  communicate.  As  the  number  of  proces¬ 
sors  Increases  the  characteristics  of  both  the  de¬ 
composition  and  scnedullng  algorithms,  and  the  com¬ 
munications  network,  become  critical  in  establishing 
acceptable  overall  system  performance  cost.  This 
paper  is  concerned  with  certain  communications  net¬ 
work  design  questions  which  arise  in  context  of 
multiprocessor  systems  designed  in  a  VLSI  environ¬ 
ment  . 

_ Various  studies  aimed  at  characterizing  and 

This  work  was  supported  in  part  by  NSF  Grant 
MCS-78-20731,  ONR  Contract  N00014-80-C-0761  and 
NIH  Grant  RR00396. 


quantifying  the  performance  of  SSI  based  networks 
have  already  been  pursued  (5,6,7).  Typically  the 
principal  figures  of  merit  used  in  these  studies 
have  been  relaced  to  che  number  of  switches  requir¬ 
ed  by  the  communications  network  and  the  bandwidth 
of  che  path  between  processors.  For  a  given  net¬ 
work  architecture,  determination  of  the  number  of 
switches  is  straightforward,  while  estimation  of 
the  bandwidth  has.  In  nose  cases,  been  derived  from 
an  analysis  of  the  average  number  of  switches 
through  which  a  signal  must  pass  and  che  blocking 
characteristics  of  the  network. 

Use  of  these  types  of  figure  of  merit  make 
sense  in  an  environment  where  the  cost  of  switching 
elements  is  substantially  greater  than  the  cost 
of  wires  and  connection  paths.  The  situation 
changes  however,  with  the  Introduction  of  VLSI 
technology.  This  fabrication  methodology  has  the 
potential  for  economically  placing  large  switching 
networks  or  subnetworks  on  a  single  chip.  Cost  here 
becomes  related  to  chip  area.  Unfortunately,  a  new 
challenge  appears:  the  implementation  of  the  con¬ 
nection  paths  may  use  substantial  amounts  of  che 
chip’ area,  thus  limiting  the  area  available  to  the 
switch  elements  themselves..  This  has  the  effect  of 
reducing  the  size  of  a  switching  network  that  can 
be  fabricated  on  a  chip  of  a  given  size.  The  time 
delay  associated  with  the  connection  paths  also 
contributes  to  the  overall  delay,  thus  directly  ef¬ 
fecting  bandwidth.  Area,  topology  and  layout, 
mainly  ignored  in  traditional  communication  network 
analysis,  become  important  interrelated  facCors  in 
VLSI  network  design  (8). 

The  advent  of  VLSI  has  thus  significantly 
changed  the  design  space  with  which  engineers  must 
contend.  More  meaningful  figures  of  merit  based  on 
parameters  of  time  (on/off  chip  time) and  chip  area 
now  seem  more  appropriate  in  many  situations.  In 
the  next  section  two  interconnection  networks 
Crossbar  (9)  and  Banyan  (10)  are  compared  in  terms 
of  their  space-time  products  when  implemented  on  a 
single  VLSI  chip. 

While  single  chip  design  questions  are  impor¬ 
tant,  when  large  networks  requiring  multiple  chips 
are  to  be  designed  with  single  chip  subnetworks  as 
the  building  components,  interchip  communications 
delays  can  dominate  overall  delay  times.  Further¬ 
more  there  is  often  a  close  relationship  between 
lntra  and  Interchip  cosraunlcatlons  network  design. 

It  may  be  advantageous,  for  instance,  for  the  com¬ 
munications  necwork  on  che  chip  to  have  a  topology 
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different  from  the  larger  interchip  network.  Fur¬ 
thermore,  this  interacts  directly  with  questions 
relating  to  chip  pin  limitations,  network  conrol, 
and  communications  path  width.  It  may  turn  out 
that  bit  serial  communications  paths  are  best  be¬ 
cause  they  preserve  pins  and  permit  large  networks 
to  be  placed  on  a  single  chip.  Reducing  the  delays 
associated  with  interchip  communications  may  more 
than  offset  the  extra  time  necessary  for  serial 
data  transmission.  The  question  of  pin  limitations 
is  explored  later  in  this  paper  and  some  prelimi¬ 
nary  results  reviewed. 

SPACE-TIME  NETWORX  COMPARISONS 
Banyan  and  Crossbar  Networks 

This  section  reviews  certain  research  results 
comparing  banyan  and  crossbar  interconnection  net¬ 
works  (11).  The  crossbar  network  considered  is  of 
the  form  shown  in  Figure  1.  While  there  are  many 
ways  of  designing  a  crossbar  (e.g. ,  demultiplexer/ 
multiplexer  designs,  switched  bus  designs),  the  ap¬ 
proach  examined  has  a  number  of  generally  desirable 
capabilities.  These  include  distributed  (local) 
network  path  control,  asynchronous  and  pipelined 
data  transfer,  and  a  high  degree  of  modularity  (9). 
Furthermore  its  naturally  regular  and  planar  layout 
appears  to  make  it  well  suited  to  VLSI  implementa¬ 
tion.  The  banyan  network  considered  is  of  the  form 
shown  in  Figure  2.  It  too  can  be  designed  for  dis¬ 
tributed  network  path  control,  asynchronous  data 
transfer  and  pipelining.  On  the  other  hand  its 
modularity  properties  are  not  quite  as  straightfor¬ 
ward  as  the  crossbar  and  its  topology,  while  regu¬ 
lar  in  a  certain  sense,  is  inherently  nonplanar. 
Note  that  both  networks  have  a  full  interconnection 
capability  in  that  any  single  input/output  connec¬ 
tion  can  be  made  by  placing  the  appropriate 
switches  in  the  proper  positions. 

While  the  properties  mentioned  above  are  im¬ 
portant,  most  work  to  dace  has  compared  the  networks 
principally  on  the  basis  of  switch  complexity  (i.e. 
number  of  switches  required  for  implementation), 
and  network  bandwidth.  For  square  N  input ,  N output 
networks  switch  complexity  for  the  crossbar  is 
A£b  *  0(n2)  while  for  the  banyan  it  is 
*BA  "  °(Nl°glO- 

Network  bandwidth  is  associated  with  three 
items.  First,  pipeline  characteristics  of  the  net¬ 
work  and  message  length  distributions  must  be  de¬ 
termined.  For  analysis  simplicity,  a  nonpipelined 
design  and  circuit  switched  mode  of  operation  are 
assumed.  Message  length  considerations  therefore 
do  not  enter  into  these  bandwidth  comparisions . 

The  second  Item  to  consider  relates  to  the  average 
number  of  switches  through  which  a  message  must 
pass  assuming  uniform  addressing  of  input  and  out¬ 
put  ports.  For  Che  crossbar  this  is  D{;b  -  0(N) 
while  for  the  banyan  it  is  O(logN).  The  final  item 
to  consider  relates  to  the  networks  blocking  chara- 
terlstica  and  the  protocol  to  be  used  when  a  mes¬ 
sage  is  blocked.  The  crossbar  is  a  strict  sense 
nonblocking  network.  That  is,  as  long  as  each  in¬ 
put  port  addresses  a  unique  output  port,  no  mes¬ 
sage  is  blocked  in  the  network  due  to  path  conten¬ 
tion  (and  no  rearrangement  of  paths  is  necessary). 
The  banyan  network  on  Che  other  hand  is  a  blocking 
network,  and  under  certain  situations  message 
blocking  can  occur.  For  N  less  than  about  2000, 


the  probability  of  this  blocking  can  be  approxima¬ 
ted  by  Pjj  *  1  -  b/Na  with  a  *  .19  and  b  _  1.05. 
Assuming  a  saturated  system,  synchronized  messages 
of  equal  length,  and  a  message  retry  protocol  where 
blocked  messages  reenter  the  system  again  with  the 
next  message  batch,  the  average  delay  through  a 
banyan  network  can  be  derived  as  D^-Odog  N )/ 

U-Pn>- 

Overall  cost  measures  for  the  banyan  and 
crossbar  can  now  be  obtained  as: 

ciB  -  acb  •  dcb  -  0(n3)  HI 

CBA  *  ^A  '  DBA  *  °<nCio«2n)  (1<c<2)  (2) 

From  these  equations  it  is  clear  that  by  tradition¬ 
al  measures,  the  banyan  is  much  less  costly  than 
the  crossbar.  This  is  also  true  for  other  block¬ 
ing  networks  such  as  the  Omega  (12)  and  indirect 
binary  N-cube  (13)  whose  switch  complexities  are 
also  0(N  log  N) . 

VLSI  Network  Implementation  Model 

Consider  next  that  the  layout  of  a  crossbar 
network  on  a  single  chip  directly  follows  the  to¬ 
pology  of  Figure  1.  Assuming  the  logic  associated 
with  each  crosspoint  fits  into  a  square  with  sides 
of  length  L,  and  the  spacing  between  squares  fol¬ 
lows  the  Mead  and  Conway  (8)  recommendations  for 
spacing  between  metal  interconnect  lines  (i.e., 

3  feature  sizes,  31),  then  the  area  in  units  of  X2 
is  given  by:  -  7 

ACB  =  -  °<»  )  f3) 

Unfortunately,  for  the  banyan  case  the  analy¬ 
sis  is  not  as  straightforward  and  one  must  refer 
to  reference  11  for  details.  The  thrust  of  the 
analysis,  however,  is  as  follows.  First  assume 
again  Chat  the  switches  in  the  banyan  network  fit 
into  a  square  with  sides  of  length  L.  Next  assume 
that  two  layers  of  metal  interconnect  are  availa¬ 
ble,  one  for  horizontal  and  one  for  vertical  lines. 
Various  layouts  may  be  proposed,  but  as  long 
as  the  general  form  shown  in  Figure  2  is  preserved 
(i.e.,  successive  rows  of  switches)  two  things  be¬ 
come  evident.  First,  the  horizontal  distance  re¬ 
quired  by  the  network  will  be  0(N)  since  this  will 
vary  directly  with  the  number  of  input  and  output 
ports.  Second,  the  vertical  spacing  between  switch 
rows  will  Increase  as  the  network  grows.  This  is 
because  the  number  of  horizontal  lines  which  re¬ 
quire  routing  between  the  right  and  left  halves  of 
the  network  increases  as  the  network  grows.  These 
lines,  being  routed  on  the  same  plane,  need  more 
area  as  one  moves  from  level  (row)  to  level.  This 
is  illustrated  In  Figure  3.  The  result  of  this  is 
that  although  there  are  O(logN)  levels,  the  verti¬ 
cal  distance  grows  as  0(N)  and  thus  the  area  grows 
as  Aba  -  0(n2>. 

The  interesting  point  to  note  here  is  that 
this  is  just  the  sa roe  as  .the  crossbar,  and  is  con¬ 
siderably  different  from  what  is  predicted  from 
switch  aggregate  counts.  One  other  point  to  note 
is  that  these  same  results  can  be  obtained  by  fol¬ 
lowing  £  graph  theoretic  argument  developed  by 
Thompson  (14). 

Developing  time  delay  models  follows  the  same 
approach  discussed  above.  As  pointed  out,  for  the 
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crossbar  an  average  path  concains  N  crosspoints. 

If  each  crosspoint  Is  implemented  in  NMOS  NOR  gates 
with:  a  fanout  f;  a  transit  time  r;  a  pullup  to 
pulldown  transitor  impedance  ratio  of  4;  m  levels 
of  logic;  a  metallzation  capacitance /transistor 
gate  capacitance  ratio  of  a;  and  intercrosspoint 
drivers  of  minimum  area;  then  the  crossbar  delay 
can  be  derived  as : 

Dcb  =  2. SNmft  +  (N=1)t(1  +  2.25a)  -  0(N)  [4] 

For  che  banyan  case  a  more  complex  expression 
can  be  obtained.  Here  certain  assumptions  must  be 
made  concerning  driving  the  metal  lines  between  le¬ 
vels  which  Increase  in  length  from  level  to  level. 
Assuming  the  metal  lines  present  purely  capacitive 
loads,  and  are  driven  by  a  matched  sequence  of  dri¬ 
ver  stages  (8)  to  minimize  delay,  it  can  be  shown 
thac  the  delay  presented  by  the  lines  is  0(log2N). 
Introducing  the  multiplicative  factor  related  to 
the  blocking  probability  yields  an  overall  delay  of 
dBA  “  0 (Nalog-N)  where  0<a<l.  For  large  N  this  is 
less  than  DCB,  however  it  is  greater  than  that  pre¬ 
dicted  from  traditional  analysis. 

The  overall  space-time  product  measure  is 
given  below: 

CCB  *  ACB  •  °CB  '  0(N  >  [51 

CBA  ‘  V  '  °BA  "  OO^Wn)  (2<d<3)  [61 

These  results  indicate  thac  while  CflA  is  still  less 
than  for  large  N,  the  costs  are  much  closer 
than  predicted  from  standard  switch  aggregate  anal¬ 
ysis.  A  more  decailed  analysis  indicates  that  for 
reasonable  values  of  N  (i.e.,  values  thac  could  be 
currently  implemented  on  a  single  chip),  the  two 
networks  have  roughly  comparable  space-time  perfor¬ 
mance 

PIN  LIMITATIONS 

One  of  the  key  constraints  on  placing  very 
large  networks  on  a  single  chip  is  the  limited  num¬ 
ber  of  pins  supported  by  standard  integrated  cir¬ 
cuit  carriers.  Consider  for  instance  the  inter¬ 
connection  network  depicted  in  Figure  4  which  es¬ 
tablishes  a  B’-bit  path  between  device  x^  and  de¬ 
vice  yj  where  l<i,j<N'.  Our  interest  is  in  de¬ 
veloping  networks  that  are  general  in  that  we  place 
minor,  if  any,  conditions  on  the  specific  numerical 
values  for  N'  and  B'.  If  the  network  of  Figure  4 
were  to  be  Implemented  on  a  single  VLSI  chip  then 
Che  number  of  required  pin  connections  (ignoring 
power,  ground,  and  general  control  such  as  reset) 
is  given  by  2N'B'.  Suppose,  for  example,  that  we 
have  a  square  interconnection  network 
with  N'  «  12  and  B'  «  16.  Then  the  num¬ 
ber  of  required  pins  is  384t much  larger  chan  common 
commercially  available  integrated  circuit  carriers. 
The  total  number  of  pins  is  limited  mainly  by  the 
Increase  in  the  physical  length  of  the  package;  the 
pins  are  typically  placed  on  100  mil  centers  if  the 
package  is  to  be  inserted  in  pads  on  printed  cir¬ 
cuit  boards.  (We  ignore  here  certain  more  advanced 
schemes  such  as  the  array  configuration  used  by  IBM). 
For  this  pin  placement,  a  64  pin  dual-in-line  pack¬ 
age  is  3.2  inches  in  length.  This  becomes  physi¬ 
cally  awkward  to  manipulate  and  also  becomes  more 
susceptable  to  breaking  forces.  The  384  pin  exam¬ 
ple,  for  Instance  would  require  a  19.2  inch  dual- 
in-line  package! 


There  are  a  number  of  potential  solutions  to 
this  pin  limitation  problem.  We  review  here  two  of 
the  more  obvious  ones  with  details  being  presented 
in  reference  15.  The  fir3t  approach  is  to  imple¬ 
ment  a  large  network  requiring  many  pins  as  a  col¬ 
lection  of  smaller  networks  where  each  of  the 
smaller  networks  can  be  contained  on  a  single  chip 
in  which  the  pin  constraints  of  the  chip  are  met. 
An  N'*N'  network  would  therefore  be  decomposed  in¬ 
to  a  set  of  subnetworks  (each  subnetwork  of  size 
N*N)  which  would  themselves  be  interconnected  in 
some  fashion. 

The  second  approach  is  to  slice  the  network  so 
that  one  creates  a  set  of  network  planes,  each 
plane  handling  one  or  more  bits  (e.g.,  B  bits)  of 
the  B'-bit  wide  datapath.  This  is  commonly  done  in 
memory  designs.  Note  that  a  potential  problem 
arises  here  due  to  the  difficulty  in  synchronizing 
the  multiple  planes.  Although  details  of  this  is¬ 
sue  will  not  be  discussed  here,  there  are  ways  of 
dealing  with  this  problem. 

The  question  to  be  considered  is  what  repre¬ 
sents  the  "best"  combination  of  datapath  slice  B 
and  chip  network  size  N  given:  an  overall  network 
size  N';  a  data  path  width  B';  an  intrachip  network 
type  T;  an  interchip  network  type  T';  a  maximum  al¬ 
lowable  number  of  pins  Np;  and  a  required  number  of 
pins  for  power,  ground  and  control  N^. 

A  Chip  Count  Model 

While  the  "best"  B  and  N  selection  refers  to 
both  che  chip  count  and  bandwidth  of  the  overall 
N'*N’  network,  due  co  space  limitations  only  the 
chip  count  analysis  is  reviewed  here.  With  regard 
to  network  types  T  and  T',  the  overall  chip  count 
is  a  function  only  of  T’,  the  interchip  network 
type.  Although  there  are  numerous  ways  of  connect¬ 
ing  subnetworks  together  Co  achieve  an  overall  net¬ 
work,  two  basic  network  types  are  considered. 

The  first  is  the  common  crossbar  network. 

Ah  8*8*1  crossbar  network  for  example  can  be  con¬ 
figured  using  4*4*1  chip  components.  The  number 
of  N*N*B  chips  required  to  Implement  an  N'*N'*B' 
network  is  given  by: 


The  second  type  of  network  considered  is  the 
banyan.  The  number  of  N*N*B  chips  needed  to  imple¬ 
ment  an  N'*N'*B'  network  is  given  by: 

Nba  ‘  fr||r|  KH  t8] 

The  first  term  in  this  expression  is  the  number  of 
bit  slices  or  network  planes;  the  second  term  is 
the  number  of  chips  at  each  level  (row)  and  the 
third  term  is  the  number  of  levels  necessary  to 
achieve  a  full  interconnection. 

Pin  constraints  can  be  introduced  by  noting 

th3L:  Np  >  K^N  +  Nfc  [9] 

where  *  4  for  a  fully  modular  crossbar  network, 
and  Ki  *  2  for  a  *b  any  an  network.  Consider  next  two 
cases.  For  case  I  the  number  of  power,  ground  and 
control  pins  are  small  compared  to  the  data  pins 
and  thus,  using  all  available  pins,  N  *  Np/Kj^B. 

This  is  typical  of  clocked  systems  where  a  small 
number  of  clock  lines  are  needed  to  synchronize  all 
the  data  lines.  For  case  II  is  not  negligible 
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and  the  number  of  control  lines  Is  proporcional  to 
the  number  of  ports,  N;  thus,  N  •  Np/(KiB  +  Q).  This 
would  be  an  appropriate  model  If  the  network  chips 
communicated  with  each  ocher  in  an  asynchronous 
manner  and  a  request/acknowledge  control  line  pair 
(Q**2)  were  associated  with  each  port. 

Each  of  the  above  expressions  for  H  can  now  be 
substituted  back  into  equations  7  and  8,  and  a  value 
of  B,  and  thus  N,  yielding  the  minimum  number  of 
chips  obtained.  To  get  some  feeling  for  this  one 
can  assume  that  large  values  of  B'  and  N*  are  pre¬ 
sent  and  that  the  number  of  chips  can  be  approxima¬ 
ted  by  expressions  7  and  8  with  the  ceiling  func¬ 
tions  removed.  From  these  continuous  functions  it 
is  clear  that  for  case  1  the  number  of  chips  is  mi¬ 
nimized  with  B*l.  This  is  reassuring  since  it 
corresponds  to  experience  with  memory  chip  design 
where  the  slice  width  is  generally  taken  as  one  bit. 
A  discrete  optimization  search  procedure  verifies 
that  this  is  true  with  the  ceiling  functions  in 
place  for  crossbar  networks,  and  for  banyan  networks 
above  a  certain  size  (N’>256).  For  case  II,  the 
B  *  1  result  generally  holds  for  the  crossbar  case. 
For  the  banyan  case,  however,  the  best  value  of  B 
varies  considerably  depending  on  the  particular  Np, 
N'  and  B'  values  being  considered.  For  instance 
with  B'-16,  N'-512  and  Np=60,  a  B-2 (N-10)  results 
in  Nba  •  1248  while  a  B-  1(N“1S)  results  in  Nba“1680. 
Typically  large  differences  in  chip  counts  occur  . 
when  a  nonopcimum  value  of  B  is  used  in  this  situa¬ 
tion.  Two  other  points  should  be  noted.  First, 
the  control  pin  overhead  in  case  II  (Q=2)  is  sub¬ 
stantial.  For  instance  with  B'*16,  N'“256  and 
Np-90;  Nba“192  with  Q-0  and  Nba'348  with  Q*2.  Sec¬ 
ond,  from  a  chip  count  point  of  view,  there  is  a 
heavy  penalty  associated  with  using  a  crossbar 
interchip  network  due  to  the  0(n2)  versus  0(N  log  N) 
network  growth  in  number  of  chips. 

The  above  sort  of  chip  minimization  analysis 
suggests  that  in  designing  an  interconnection  net¬ 
work  chip  set,  chip  control  procedures  which  are 
proportional  to’  the  number  of  I/O  ports  be  avoided 
(i.e.,  no  request /acknowledge  pairs  on  a  per  port 
basis),  and  a  banyan  like  interchip  network  be  used. 
Under  these  conditions,  a  path  slice  of  B»1  seems 
appropriate.  Not  considered  here  is  the  question 
of  how  this  path  width  and  interchip  network  selec¬ 
tion  effect  overall  bandwidth.  Initial  results  (15) 
indicate  that  the  selections  above  are  also  appro¬ 
priate  for  bandwidth  optimization. 

Other  questions  remain  to  be  answered.  One 
relates  to  whether  having  separate  network  chips 
are  appropriate  given  their  pin  requirements .  Per¬ 
haps  structures  which  include  both  networks  and 
processors  are  more  reasonable.  A  variety  of  ques¬ 
tions  relating  to  centralized  versus  decentralized 
network  control,  the  tradeoffs  associated  with  cir¬ 
cuit  switched,  packet  and  pipelined  network  designs, 
and  the  various  options  associated  with  synchronous 
versus  asynchronous /delay  insensitive  design  remain 
to  be  explored. 
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Abstract:  Multiple  processor  interconnection  net¬ 
works  can  be  characterized  as  having  N'  inputs  and 
N'  outputs,  each  B'  bits  wide.  Construction  of 
large  networks  requires  partitioning  of  then' 
network  into  a  collection  of  N*N  switch  modules  of 
daca  size  B  (B  <  B')  each  implemented  on  a  single 
chip  and  Interconnecting  them  with  a  specific 
interchip  network  type  T* .  The  major  constraint 
in  the  VLSI  environment  is  the  pin  limitation,  N 
of  the  individual  modules;  these  are  allocated  " * 
between  data  and  control  lines,  Q.  This  paper 
presents  a  methodology  for  selecting  the  optimum 
values  for  N  and  B  given  values  of  the  parameters, 
N* ,  B' ,  T1 ,  Q,  and  N.  Models  for  both  the  banyan 
and  crossbar  networks  are  developed  and  arrange¬ 
ments  yielding  minimum  number  of  chips,  average 
delay  through  the  network,  and  product  of  number  of 
chips  and  delay,  are  presented.  A  bit  slice 
approach  (B  «  1)  produces  the  optimum  arrangement 
for  the  crossbar,  while  for  the  banyan  the  optimum 
is  achieved  with  multiple  bits  per  module. 

Introduction 


Over  the  past  few  years  a  variety  of  physically 
local,  closely  coupled  multiple  processor  systems 
have  been  proposed  (1,2, 3, 4).  One  key  issue  in 
the  design  of  such  systems  concerns  the  communica¬ 
tions  network  used  by  these  multiprocessor  systems. 
Various  studies  have  focused  on  the  functional 
properties  of  such  networks  (i.e.,  what  permura- 
tatlons  are  possible,  what  control  algorithms  are 
needed,  etc.),  on  their  complexity,  and  to  some 
extent  on  performance  issues  (5,  6,  7,  8,  9). 

In  most  cases  network  complexity  has  been  measured 
by  the  number  of  elementary  switching  components 
needed  by  a  network  of  a  given  size  and  type,  while 
performance  has  been  determined  by  the  average 
number  of  elementary  switching  components  through 
which  a  message  must  pass  (i.e.  average  delay). 
Recently  work  has  begun  on  examining  complexity  and 
performance  questions  in  the  context  of  VLSI  imple¬ 
mentation  of  such  interconnection  networks. 

Franklin  (10)  has  compared  two  networks .crossbar  and 
banyan,  operating  in  a  circuit  switched  mode  in 
terms  of  their  space  (area)  and  time  (delay) 
requirements.  The  networks  were  assumed  to  be 
implemented  as  complete  modules  on  a  single  VLSI 
chip. 

Cloaer  examination  of  VLSI  network  implementa¬ 
tion  problems  shows  that  pin  limitations,  rather 
than  chip  area  or  logical  component  limitations, 
are  a  major  constraint  in  designing  very  large 
interconnection  networks.  Consider,  for  Instance, 
a  network  with  N'  inputs,  M’  outputs  and  with  each 
output  being  B'  bits  wide  (N'*M’*B').  The  number 
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of  required  pin  connections  (ignoring  power, 
ground  and  general  control)  for  a  single  chip 
Implementation  is  given  by  B'(N'+M').  For  a 
square  network  of  size  twelve  with  B'”16,  the  num¬ 
ber  of  pins  required  would  thus  be  384;  much 
larger  than  common  commercially  available  integra¬ 
ted  circuit  carriers.  Given  that  pins  are 
typically  placed  on  100  mil  cencers  along  the 
periphery  of  the  package,  the  total  number  of  pins 
is  limited  mainly  by  the  Increase  in  the  physical 
length  of  the  package.  For  this  pin  placement 
and  the  384  pin  example,  a  19.2  inch  dual-in-line 
package  would  be  required. 

In  this  paper  we  focus  on  two  of  the  more 
obvious  solutions  to  this  pin  limitation  problem. 
The  first  approach  is  to  Implement  a  large  network 
(N'*N')  requiring  many  pins  a3  a  interconnected 
set  of  smaller  subnetworks  (N*N)  where  each  of 
the  smaller  networks  can  be  contained  on  a  single 
chip  in  which  the  chip  pin  constraints  are  met. 

The  second  approach  is  to  slice  the  network 
so  that  one  creates  a  set  of  network  planes,  each 
plane  handling  one  or  more  bits  (e.g.,  B  bits)  of 
the  S'  wide  datapath.  This  is  commonly  done  in 
memory  designs.  A  potential  problem  arises  in 
this  approach  due  to  the  difficulty  in  synchroni¬ 
zing  the  multiple  planes.  This  Is  discussed  in 
reference  11. 

The  remainder  of  this  paper  deals  with  deter¬ 
mining  the  "best"  combination  of  datapath  slice  B 
and  chip  network  size  N  given: 

1.  N':  An  overall  network  size  (N  <■  N'), 

2.  B':  A  data  path  width  (B  <«  B'), 

3.  T  :  An  intrachip  network  type  (e.g., 

the  interconnection  network  imple¬ 
mented  within  the  N*N  chip  might  be 
a  crossbar) . 

4.  T':  An  interchip  network  type  (e.g., 

the  interconnection  network  imple¬ 
mented  between  the  N*N  chips  to 
achieve  the  overall  N'*N'  network 
might  be  an  Omega  necwork). 

5.  N  :  The  maximum  number  of  pins  allowed 

^  on  a  chip. 

6.  N  :  The  number  of  pins  on  a  chip 

*  allocated  to  power,  ground,  and 
control. 

"Best"  in  this  context,  refers  to  both  chip  count 
and  bandwidth  of  the  overall  N'*N'  network. 

Figures  1,  2  and  3  illustrate  a  general  N'*N’ 
network,  and  a  possible  decomposition  of  a  sample 
16*16  network.  In  the  next  section  basic  models 


for  this  problem  are  presented  and  used  to  deter¬ 
mine  the  B  and  N  combinations  which  minimize  the 
total  chip  count,  the  overall  network  delay  and  an 
overall  performance  measure  using  the  product  of 
chip  count  and  time  delay. 

The  Basic  Model 

The  basic  model  consists  of  two  parts.  The 
first  relates  to  the  chip  count  while  the  second 
concerns  network  time  delay.  For  brevity,  only 
square  fully  connected  networks  (i.e.  there  is  a 
path  from  each  input  port  to  each  output  port)  are 
considered.  Note  that  certain  input/output  paths 
may  have  a  common  subpath  and  this  may  result  in 
messages  being  temporarily  blocked. 

Let  us  refer  to  the  N*N*B  chip  as  a  switch 
module;  a  number  of  these  modules  will  be  inter¬ 
connected  to  realize  the  N'  network.  This  paper 
considers  two  types  of  interchip  networks  (T* ) : 
the  incremental  crossbar,  CB,  and  the  banyan  BA 
(12,13,14).  While  there  are  many  ways  of  design¬ 
ing  a  crossbar  network  (e.g.,  demult ip lexer /multi¬ 
plexer  configuration,  switched  multiple  busses, 
etc),  the  incremental  crossbar  design  (Figure  4) 
can  be  expanded  on  a  unit  basis  by  adding  basic 
switch  modules  in  a  row-column  arrangement.  This 
modularity  property  permits  flexible  expansion 
while  retaining  the  nonblocking  and  full  connection 
properties  of  the  crossbar.  A  price  is  paid  for 
these  properties  in  terms  of  number  of  switches 
and  pins  required  on  a  switch  module.  While  the 
number  of  switches  required  per  switch  module  may 
not  be  a  serious  constraint  with  VLSI  technology, 
the  problem  of  pin  constraints  is  severe.  For  the 
incremental  crossbar,  the  modularity  property 
requires  4NB  data  pins  to  implement  a  N*N*B  switch 
module  while  the  banyan,  a  blocking  network, 
requires  2NB  data  pins. 

To  make  global  comparisons  similar  and  to 
eliminate  blocking  at  the  switch  module  level,  this 
paper  examines  cases  in  which  the  switch  modules 
are  constructed  using  an  incremental  crossbar  arch¬ 
itecture  (T  •  CB) .  Two  types  of  module  inter¬ 
connections  are  examined;  the  crossbar  and  the 
banyan  (Tf  *  CB  or  T*  -  BA). 

Chip  Count  Model 

As  Illustrated  in  Figure  4,  the  number  of 
N*N*B  chips  required  to  Implement  an  N'^N'^B* 
Incremental  crossbar  network  is  given  by: 

Ncb  -  |b’/b][n’/n]2  [1] 

The  banyan  network  is  one  of  the  class  of 
blocking  networks  whose  logical  component  complex¬ 
ity  grows  as  0(N  log  N)  rather  than  0(N**2).  A a 
illustrated  in  Figure  5,  the  number  of  N*N*B  chips 
needed  to  Implement  as  N'*N**B'  banyan  network  is 
given  by: 

Nba  -  [b'/b]  [n'/n]  K*-]  [2] 

The  first  term  in  this  expression  is  the  number  of 
bit  slices  or  network  planes  that  are  required. 

The  second  terra  represents  the  number  of  chips  at 
each  level  (row),  while  the  third  term  is  the  num¬ 
ber  of  levels. 


Time  Delay  Model 

A  model  giving  the  average  time  for  a  signal 
to  propagate  through  a  network  must  Include  the 
time  to  traverse  each  of  the  chips,  the  time  to 
propagate  from  chip  to  chip,  and  since  the  bit 
slice  approach  separates  the  bits  in  a  data  word, 
the  additional  time  chat  is  needed  to  make  certain 
that  all  the  data  bits  have  completed  their  move¬ 
ment  through  the  N^N^B’  network. 

The  average  delay  associated  with  a  basic 
switch  module  will  be  designated  as  D  ^  since 
these  modules  have  a  crossbar  construction.  Path 
setup  delays  (i.e.,  time  to  set  switches  in  their 
desired  positions)  are  not  considered  here.  The 
delay  of  a  pin  driver  and  associated  interconnec¬ 
tion  wires  between  modules  (i.e.  the  intermodule 
delay)  is  denoted  by  .  The  Intermodule  delays 
for  the  CB  and  BA  networks  are  different  and  will 
be  denoted  as  D.  .  and  D.  .  .  Additional  syn¬ 
chronization  delay  introduced  by  the  designer  to 
assure  that  all  data  bits  have  traversed  the  net¬ 
work  will  be  represented  by  D  .  and  D  .  . 

For  the  CB  network  the  avfifage  delay^can  be 
determined  by  examining  Figure  4.  Note  that  this 
represents  one  of  Tb'/bI  planes.  Assume  that  each 
switch  module.  Implemented  on  a  single  chip, 
represents  an  N*N  CB  network.  The  pin  drivers  for 
each  module  are  also  located  on  the  chip.  For 
this  arrangement  the  number  of  modules  in  an 
average  path  is  fN*/N*|  and  each  intermodule  path 
has  the  same  delay  D  .  .  Therefore  the  average 
network  delay  D*^  isglven  by: 

D'cb  ■  ^,/Nl  Dcb  +  CN'/N1  Dimcb  +  Dsyncb  131 

Note  that  a  circuit  switched  design  is  assumed 
here  with  no  pipelining  between  modules. 

For  the  BA  network  the  number  of  switch 
modules  and  the  number  of  intermodule  connections 
is  log^N* .  Here,  because  of  the  connection  topo¬ 
logy,  the  Intermodule  paths  are  not  constant  in 
length.  The  average  delay,  D*  ,  through  such  a 
network  (assuming  no  delay  penalty  for  blocking) 
is  given  by: 


K-K  +  Md 


imba  +  Dsynba 


Pin  Constraints 


For  a  square  N*N*B  chip  with  N  pins  alloca¬ 
ted  to  power,  ground  and  control,  tne  pin  con¬ 
straint  is  given  by: 


where  K  ■  4  for  the  CB  network  and  K  ■  2  for  the 
BA  network.  The  equality  will  be  used  since  it  is 
advantageous  to  utilize  as  many  available  pins  as 
possible.  Two  cases  nay  be  considered.  Case  1  is 
the  situation  where  the  number  of  data  pins  is 
much  larger  than  N,  (i.e.,  KBN  >>  N.)  and  thus 
equation  5  becomes: 


Lnp/kbJ 


This  is  typical  of  a  clocked  system  where  a  small 
number  of  clock  lines  are  needed  to  synchronize 
all  the  data  lines. 


Case  2  encompasses  Che  situation  where  N  is 
not  neglible  and  there  is  a  control  line  overhead 
associated  with  the  data  paths.  Assuming  that  the 
number  of  control  lines  is  proportional  to  the  num¬ 
ber  of  ports,  N,  on  an  individual  chip  Ci.e.  N^^QN 
where  Q  is  a  constant),  N  can  be  expressed  as: 

N  -  |Np/(KB+Q)j  17] 

This  would  be  the  appropriate  model  if  network 
chips  communicated  with  each  other  in  an  asyn¬ 
chronous  manner  and  the  control  line  overhead 
consisted  of  request /acknowledge  pairs  CQ  ■  2). 


Chip  Count  Minimization 


For  large  networks  with  large  datapath  widths 
and  chins  with  a  lar«e  number  of  Dins t the  ceiling 
and  floor  functions  can  be  removed  from  [1]  and 
[2],  and  [6]  and  [7].  Then  N  and  N 

can  be  approximated  as  continuous  functions .Assume 
that  all  available  pins  are  used  and  consider  Case 
1  where  N  is  given  by  equation  6.  Substituting  eq¬ 
uation  6  with  K  »  4  and  K  «  2  respectively  into 
continuous  versions  of  equations  1  and  2  yields: 


N  .  .  - 
cbl 

16BB'  (N**2)/N  **2  - 
P 

KcbB 

[8] 

M  m 

2B'N' log  N' 

*ba 

bal 

N  (logN  -log2B) 

logN  -  log2B 

19] 

For  a  given  pin  constraint  N  ,  and  overall  network 
requirements  N'  and  B'  ,  K  8nd  K,  are  constants. 
Minimizing  N  and  N  far  this  case  requires 
that  B  be  minimized,  "fhe  smallest  datapath  width 
possible  is  B  ■  1,  hence  with  this  model  N  should 
be  selected  to  be  N  /K.  This  result  corresponds 
to  memory  chip  desi|n  where  the  slice  width  is  gen¬ 
erally  taken  as  one  bit.  Note  however,  that  this 
was  obtained  with  a  continuous  approximation  to  eq¬ 
uations  1  and  2;  while  B  •  1  yields  a  minimum  num¬ 
ber  of  chips  in  most  cases,  there  are  situations 
where  other  values  of  B  are  better.  For  Instance, 
with  a  BA  network  with  N„  -  60,  N'  «  128  and  B'”16, 
a  B  -  l  solution  yields  fi,  .  ■  160,  while  a  B  •  2 
solution  yields  N  .  *  145? 

For  case  2  wnere  N  is  not  negligible  equation 
7  is  used  for  N  and  substituted  back  into  Che  con¬ 
tinuous  versions  of  [1]  and  [2]  to  give: 


Ncb2  -  (4B*))2B'N'2  -  Kcb(4B**)2 
BN  2  16B 

N,  -  -  'CbaP(2^> 

2B(logNp  -  log (28+Q) ) 


[10] 

[11] 


The  derivatives  of  N  .  and  N  2  with  respect  to  B 
can  now  be  taken,  an8°the  values  of  B  and  N  which 
minimize  Che  chip  count  obtained. 

For  the  case  of  T*  a  CB,  the  number  of  chips 
N  is  minimized  when  B  «  Q/4.  Thus  for  a  request 
/acknowledge  pair  associated  with  each  chip  data¬ 
path  (Q  •  2),  B  would  be  selected  as  1.  While  this 
is  true  for  almost  all  cases  considered,  the  conti¬ 
nuous  model  approximation  should  be  checked  when  N' 
is  less  than  64  or  B’  is  less  chan  16  (e.g.,  For  N' 


-  32  N  •  75,  Q  -  2  and  B' 
P 


16;  B  -  1  yields  N 


cb2 


-  144:  B  »  2  yields  Nefe2  -  128). 

For  the  case  of  t'  a  BA  network,  an  equa¬ 
tion  can  be  derived  for  obtaining  the  optimum  B 
and  N  and  indicates  that  the  continuous  model  does 
not  yield  optimum  values  in  many  situations. 

For  instance,  for  N  »  90,  N'  -  512,  Q  »  2  and  B'” 
16,  a  search  procedBre  working  directly  with  equa¬ 
tion  2  gives  an  optimum  B  •  4  and  yields  N  ,  “ 

684.  Note  that  using  B  ■  1  in  this  case  reiults 
in  N  »  1152.  This  is  not  unusual,  and  in  most 
case§aTN  <  140)  where  Q  ^  2,  a  choice  of  B  ■  1 
will  be  Ronopciraal. 

Equations  1  and  2  were  solved  using  optimal 
values  of  N  and  B.  and  the  chip  count  was  obtaiied 
as  a  function  of  the  parameters  N  ,  N'  and  network 
type  T*.  Figure  6  lllBstrates  how  the 

total  number  of  chips  varies  as  a  function  of  the 
network  size.  Plots  for  two  different  values  of 
N  and  Q  are  also  given.  For  a  given  N' ,  N  andQ^ 
tRe  BA  requires  fewer  chips  than  the  CB  impiementa- 
tlon  and  the  curves  agree  with  the  observation  that 
the  crossbar  grows  as  0(N'**2)  while  the  banyan 
grows  as  OCN’LogN').  As  expected,  increasing  Q  or 
N  requires  a  larger  number  of  chips  for  both  the 
binyan  and  the  crossbar.  Although  not  shown  expli¬ 
citly  in  these  gTaphs,  the  optimum  value  of  B  is  1 
for  the  crossbar  (N  >  64),  while  for  the  banyan 
the  optimum  B  rangeB  from  1  to  4  (Np  64)  . 

Network  Delay  Minimization 

Next  we  determine  expressions  for  the  delays, 

D  b ,  D.  ,  and  D  and  incorporate  these  into  equa¬ 
tions  2T and  4  tS^compute  the  average  delay  through 
the  two  networks. 

Crossbar  Network 


The  value  of  D  has  been  developed  by  Frank¬ 
lin  Q0)  using  NMOS  NOR  gates  for  construction  of 
the  crossbar  module  and  is  given  as: 

D  «  N[2.5mfr  +  t(1+2.25o  .)]  -  NA,  [12] 

CD  CD 

The  parameters  are  defined  in  Table  1  which  also 
gives  some  typical  values.  The  equation  assumes  a 
circuit  switched  CB,  and  uniformly  distributed  ad¬ 
dressing  of  module  output  ports.  The  first  term  in 
the  brackets  represents  the  delay  through  an  lndiv- 
ual  switch  within  a  module,  while  the  second  term 
is  the  delay  between  switches  in  a  module. 

The  delay  encountered  when  a  signal  goes  off 
the  chip,  propagates  along  an  interconnecting  path 
and  enters  another  switch  module  is  D  .  A  buf¬ 
fer  (e.g.,  a  series  of  inverters)  must  tie  Included 
within  the  switch  module  to  allow  the  minimum  size 
transistor  to  drive  the  module  pin  and  associated 
load  with  minimum  delay.  The  buffer  delay  is  det¬ 
ermined  by  the  gate  capacitance  of  the  minimum  six 
transistor,  the  number  of  stages  in  the  buffer,  ths 
capacitance  of  the  pin  being  driven,  theapacltance 
along  the  interconnecting  path,  and  the  capacitan® 
of  the  receiving  module  pin.  This  delay  is  minimi¬ 
zed  when  exponentially  sized  cascaded  Inverters  a* 
used  (14).  The  delay  in  this  case  is: 

D,  .  «  relog  8  .  [13] 

imeb  °e  cb 

where  8  ,  is  the  ratio  of  the  buffer  load  capaci- 
cb 


once  co  the  buffer  input  transistor  gate  apacitance. 
The  transistor  gate  capacitance,  C  ,  is  the  capaci¬ 
tance  per  unit  area  times  the  gate^area  of  the  min¬ 
imum  transistor.  To  determine  the  load  capacitance 
assume  that  the  driving  and  receiving  pin  capaci¬ 
tances  are  equal  and  each  has  a  value  of  C  .  .  Fu*- 
Cher  postulate  that  the  modules  for  the  CB*w?ll  be 
placed  on  a  circuit  board  and  interconnected  via 
printed  circuit  copper  paths.  Given  the  planar  to¬ 
pology  of  the  CB,  the  spacing  between  modules  will 
generally  be  less  than  one  inch.  Pin  capacitance 
will  dominate  in  this  case  and  8  .  *C2C  +C  J  /C 
*  2C  /C  .  cb  pln  patK  8 

P'ffife  fynchronization  delay  depends  upon  the  spe¬ 
cific  design  technique  used  to  determine  that  all 
bits  have  traversed  the  network.  Assuming  selfti&ed 
design  strategy,  a  reasonable  design  practice  is  to 
include  a  tolerance  or  guard  region  that  is  propor¬ 
tional  to  the  average  delay  time.  The  average  delay 
can  thus  be  expressed  as: 


D’ 


cb 


Kx  [V/n] 


(D 


cb 


Dimcb  5 


[14] 


K  ■  1  +  K  ,  and  D  and  D..  are  given  in 
nd  [13].  ffumericafstudies  nave  shown  that 


where  K 
[12]  ana 

for  the  CB  with  Q  «  0  the  continuous  form  of  [14] 
usually  gives  the  same  results  as  the  discrete  form 
Therefore  we  shall  replace  fN'/N]  N  by  N'.  Finally, 
using  equation  7  with  K  «  U  gives: 


D' 


cb 


KjN'  [A„ 


Dimcb  (4B+Q) /Np] 


[15] 


To  minimize  D'^  ,  4B  +  Q 
this  means  that 


Q  -  0, 

Notice  that  O' 


should  be  minimized.  With 
.in.,  D'  is  minimized  when  B  -  1. 
cb  Is  directly  proportional  to  N' ,  aid 
decreases  to  a  minimum  value  as  the  number  of  pins 
N  Increases.  Consider  next  the  typical  parameter 
vSlues  given  In  Table  1.  For  N  large  (i.e.  >“  64), 
Q  ■  0  and  B  -  l,  the  average  de?ay  can  be  approxima¬ 
ted  as: 


D'  3  6. 2N'  nsec 
cb 


[16] 


Banyan  Network 


The  average  delay  through  the  BA  network  Is  giv¬ 
en  by  equation  4.  The  value  of  D.  Is  known  from 
[12]  and  we  assign  a  value  to  D  c°hat  la  propor¬ 
tional  to  the  average  path  delaV^hrough  the  net¬ 
work.  The  only  remaining  quantity  to  determine  Is 
the  value  for  D  .  The  development  follows  that 
presented  for  tnecB.  In  this  case  however  the  sep¬ 
aration  between  switch  modules  in  the  BA  Is  not  con¬ 
stant,  and  C  .  will  vary  according  to  the  banyan 
level.  Sind  tne  number  of  levels  required  for  a 
specific  configuration  Is  not  known  a  priori,  the 
Inclusion  of  a  variable  for  C  complicates  the 
delay  computation.  The  last  ii$8l  has  the  longest 
path  (3  inches)  and  therefore  the  maximum  capaci¬ 
tance.  The  capacitance  of  a  typical  printed  circuit 
path  Is  approximately  1  pf/lnch  thus  the  delay  In 
driving  this  longest  path  Is: 


°laba  *  Te  l0*,«2Cpln+S)/V 


H7] 


By  decreasing  Che  pin  driver  area  as  Che  banyan  lev¬ 
el  decreases  this  value  applies  Co  all  levels. 


The  average  delay  chrough  Che  banyan  neework  can  nor 
be  expressed  as:  [18] 

D’ba  '  KJl0*NNi  [NA»  +  Te  ^e^pin+^V1 

The  continuous  version  of  this  equation  la  a  poor  ap¬ 
proximation  to  the  discrete  version,  thus  only  the 
discrete  will  be  used.  Using  the  values  from  Table 
1  the  banyan  delay  can  be  expressed  as: 

D'^  -  6.17  flog^'lfN  +  1.78')  [19] 

The  discrete  relations  for  the  CB  and  the  BA 
delays  were  solved  using  optimal  values  of  N  and  B, 
and  the  delays  obtained  as  a  function  of  parameters 
N  ,N’  and  network  type  T1.  The  banyan  delay  is  con¬ 
sistently  smaller  than  the  crossbar  for  networks  of 
reasonable  size. 

Chip  Count-Time  Product  Minimization 

The  chip  count-time  product  P,  can  be  obtained 
by  multiplying  the  appropriate  equations  given  pre¬ 
viously.  Earlier  discussion  Indicated  that  for  rea¬ 
sonable  size  networks,  both  chip  count  and  delay 
were  minimized  In  the  CB  case  with  B  *  1.  Conseqien- 
tly  the  product  Is  also  minimized  with  this  choice 
Cfor  N'  >  64,  B’  >_  16). 

For  the  BA,  the  situation  is  more  complex  and  a 
computer  search  for  the  optimum  B  and  N  values  must 
undertaken.  Consider  the  case  of  Q  •  0  and  N'  »  512. 
Table  2  shows  the  values  of  N,B  which  optimize  the 
number  of  chips,  the  delay,  and  chip  count-time  pro¬ 
duct.  The  B  and  N  values  required  for  minimizing  P 
fall  between  those  needed  for  minimization  of  the 
chip  count  and  delay  measures  by  themselves.  The 
count  minimization  Is  achieved  by  attempting  to 
place  as  large  a  network  as  possible  on  a  given  chip 
Delay  minimization  is  achieved  by  balancing  the  de¬ 
lays  associated  with  the  module  network  and  the  de¬ 
lays  associated  with  Increasing  the  number  of  levels 
In  the  overall  network.  In  this  case  placing  as 
large  a  network  on  a  chip  as  possible  Is  not  the 
best  strategy  from  a  delay  point  of  vlev.  Note  that 
this  analysis  does  not  consider  delays  associated 
with  network  blocking  which  can  have  a  significant 
effect  In  a  saturated  network. 

Values  for  N  and  B  which  minimized  P  were  ob¬ 
tained  for  both  network  types  over  a  range  of  N',  N 
and  Q  values  (Figure  7) .  As  expected  P  Increases  p 
with  Increasing  N'  and  increasing  Q,  and  decreases 
with  Increasing  N  .  Once  again  the  banyan  does  be*- 
ter  than  the  crossbar  on  this  overall  performance 
measure. 

Summary  and  Conclusions 

This  paper  concerned  the  design  of  multiple  pro¬ 
cessor  interconnection  networks.  Models  for  both 
the  banyan  and  crossbar  networks  (T')  were  develop 
and  arrangements  yielding  minimum:  number  of  chips, 
average  delay  through  the  network,  and  product  of 
number  of  chips  and  delay,  were  presented.  The  re¬ 
sults  show  that  for  the  crossber  a  bit  slice  approoh 
(B  -  1)  produces  the  optimum  arrangement,  while  for 
the  banyan  the  optimum  Is  achieved  with  multiple Mts 
per  module.  The  Impact  of  the  number  of  control 


lines  on  chip  count,  delay  and  product  were  also 
modelled. 

The  analysis  presented  made  a  number  of  assui*)- 
tions  whose  effects  are  being  further  investigated. 
In  particular  the  role  of  blocking  in  the  banyan 
case,  the  potential  gain  which  would  accrue  from  a 
pipelined  design,  and  the  problem  of  synchronize  ion 
between  network  planes  is  being  studied. 


Parameter 

Symbol 

Units 

Typical 

7. 

Value 

minimum  feature  size 

X  . 

min  ~ 

um~ 

3 

minimum  gate  area 

A  .  -41 

mig 

(ua) 

36. 

8. 

gate  capacitance 

pf 

1.4*10 

switch  module  pin  cap. 

c  <8 
pin 

pf 

5 

transit  time 

NOR  gate  logic  levels 

i 

nsec 

0.3 

9. 

per  crossgate 

m 

— 

2 

NOR  gate  fanout 

Metal  path  cap.  to 

f 

2 

transistor  gate  cap. 
ratio  (switch  module) 

aCB 

— 

0.1 

10 

guard  region 

K 

— 

0.1 

11 

multiplier 

printed  circuit 
path  cap. 

c 

path 

Pf 

lpf/inch 

length  of  longest  BA 

s 

inches 

12 
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P  COUNT  MINIMIZATION 

30 

1 

576 

392 

226 

DELAY  MINIMIZATION 

5 

6 
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168 
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PRODUCT  MINIMIZATION 

10 

3 

936 

218 

204 

N  -  90 

F  COUNT  MINIMIZATION 

45 

1 

384 

578 

222 

DELAY  MINIMIZATION 

5 

8 

824 

168 

138 

PRODUCT  MINIMIZATION 

11 

4 

564 

237 

133 

N  -  120 

p  COUNT  MINIMIZATION 

60 

1 

288 

763 

220 

DELAY  MINIMIZATION 

5 

11 

824 

168 

138 

PRODUCT  MINIMIZATION 

10 

6 

468 

218 

102 

TABLE  2:  BANYAN  NETWORK  MINIMIZATION  RESULTS 
(N*  -  512,  Q  -  0,  B'  -  16) 

(CDP:  Count  Delay  Product) 
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WORD  INCONSISTENCY  IN  PARTITIONED  VLSI  INTERCONNECTION  NETWORKS 


M.A.  Franklin,  D.F.  Wann,  and  W.J.  Thomas 
Washington  University 
St.  Louis,  MO  63130 

1.0  INTRODUCTION 

Large,  closely  coupled  multiprocessor  systems  typically  require  the 
presence  of  interconnection  networks  to  provide  for  high  bandwidth  communica¬ 
tions  paths  between  processors.  As  the  size  of  the  systems  under  consideration 
has  grown,  the  importance  of  the  design  of  the  interconnection  network  has 
become  more  apparent,  and  a  number  of  studies  have  been  undertaken  to 'character¬ 
ize  such  networks  both  from  functional  and  VLSI  implementation  viewpoints 
(1,2, 3, 4, 5).  Due  principally  to  pin  constraints,  such  networks  when  implemented 
via  VLSI  methods  require  partitioning  into  multiple  chips.  In  general  a  square 
N'*N'*B'  (N'  inputs,  N'  outputs,  B'  bit  wide  data  path)  network  (Figure  1)  can 
be  partitioned  into  multiple  chips  each  of  size  N*N*B  (N  <_  N' ,  B  _<  B')  which  can 
be  interconnected  to  form  the  desired  primed  network.  Figure  2  illustrates  the 
Z=fB'/Bl  planes  needed  to  implement  an  N'*N'*B'  network.  Given  a  performance 
criteria  the  optimum  B  and  N  can  be  shown  to  be  a  function  of  the  network  type 
(e.g.,  omega,  banyan,  crossbar),  the  form  of  the  control  structure,  and  the  pin 
constraints.  For  example,  if  it  is  desired  to  minimize  the  number  of  chips  and 
an  incremental  crossbar  architecture  of  the  type  shown  in  Figure  1  is  used  both 
between  and  within  the  network  chips,  reference  (6)  demonstrates  that  for  many 
cases  the  minimum  is  achieved  by  selecting  B  =  1.  In  this  case  the 
architecture  becomes  the  common  bit  slice. 

This  papei  focuses  on  an  important  problem  which  arises  when  a  network  is 
partitioned  across  the  bits  in  a  data  word.  Consider  a  particular  source, 
whose  B'  bits  are  partitioned  into  Z  planes.  Thus  can  be  represented  as 

Si  "  ^i*  ^i*  ...SP....SZ.}  1  <  i  <  N'  [1] 

where  the  superscript  identifies  the  plane.  A  specific  plane,  p,  then  can 
interconnect  only  the  bits  from  the  pth  partition  of  each  source,  (S15.)  to  the 
pth  partition  of  each  destination,  (Dp^) .  This  is  illustrated  in  Figure  3. 

Since  the  interconnection  network  supports  concurrent  transactions,  two 
sources,  say  and  S.,  may  make  simultaneous  requests  to  be  connected  to,  say 
D^.  The  arbitration  of  these  requests,  and  the  appropriate  path  selection 
through  the  network,  can  be  handled  either  on  a  central  basis  (i.e.,  individual 
crosspoints  controlled  from  one  central  location)  or  on  a  distributed  basis 
(i.e.,  each  crosspont  making  its  own  decision).  Because  of  reliability  and 
extendability  considerations  -  of  particular  importance  in  VLSI  -  a  modular 


u 
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decentralized  control  structure  has  significant  realization  and  perhaps  perfor¬ 
mance  advantages.  Unfortunately,  if  a  distributed  control  arbitration 
arrangement  is  used  a  non-homogeneous  word  can  be  received.  To  understand  this, 
suppose  that  and  S.  are  both  requesting  a  path  to  at  about  the  same  time. 
It  is  possible  that  a  path  from  to  will  be  established  on  chip 
p  (Sp^  to  Dp^)  while  on  chip  q  a  path  from  to  will  be  established 

(Sq.  to  Dq  ).  This  is  illustrated  for  the  case  where  B'  =  16,  B  =  1  in  Figure  4 
1  k 

in  which  captures  the  path  of  plane  14.  The  received  is  connected  to 


D.  =  (S1  ,  S2  ,...S13  ,  S14  ,  S15  ,  S16  } 

<v  1  J  1  1 


[21 


14  14 

and  an  inconsistency  occurs  at  plane  14.  The  path  from  S  .  to  D  is  blocked 

14  IK 

because  of  the  connection  to  S 

3 

Notice  that  there  is  a  switching  module  at  which  requests  from  and 
intersect,  and  at  which  an  arbitration  may  occur.  Neglecting  the  arbitration 
uncertainty,  which  is  a  small  effect  (7,8),  the  path  to  the  destination  is 
awarded  to  the  request  that  arrives  first.  But  a  source  request  is  divided  into 
Z  requests,  one  for  each  plane.  Therefore,  even  if  one  of  the  sources  makes  a 
request  prior  to  another  source,  the  Z  plane  level  requests  may  not  arrive  at 
their  respective  arbitration  modules  first  on  all  planes.  This  can  happen 
because  the  propagation  delay  along  a  path  is  not  a  constant  from  plane  to  plane. 
It  is  thus  possible  that  the  propagation  delay  from  to  the  arbitration  module 
on  plane  p  will  be  greater  than  the  propagation  delay  from  ,  while  on  plane  q 
the  propagation  delay  from  will  be  less  than  from  .  This  problem  is  dis¬ 
cussed  in  more  detail  later.  Thus  a  received  B'  bit  word  can  contain  a 
nonhomo geneous  set  of  bits  if  the  interconnection  network  utilizes  a  distributed 
methodology  for  establishment  of  the  paths  through  each  of  the  partitions, 
rather  than  a  single  path  controller  assignment  strategy.  If  this  local  dis¬ 
tributed  control  structure  is  selected  (because  of  its  other  desirable  features) 
then  some  mechanism  must  be  included  to  detect  that  each  of  the  paths 
established  to  through  the  partitions  are  from  the  same  source.  We  define  a 
word  received  at  the  destination  as  being  inconsis tent  if  all  its  B*  bits  are 
not  from  the  same  source. 

In  this  paper  we  assume  that  pin  limitations  and  minimization  of  the  total 
number  of  chips  dictate  partitioning  across  the  B'  bits  and  that  reliability 
and  modularity  considerations  require  decentralized  control.  We  then  investi¬ 
gate  two  problems: 
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1)  What  types  of  decentralized  control  allow  modularity  and  also  permit 
the  detection  of  inconsistent  words? 

2)  Given  the  statistical  properties  of  the  requests  from  the  N'  sources, 
what  is  the  probability  that  an  inconsistent  word  will  be  received  by  a 
destination? 

A  solution  to  the  first  problem  permits  us  to  implement  locally  controlled  net¬ 
works  that  can  detect  an  inconsistent  connection;  if  inconsistent  a  retry  can  be 
initiated.  A  solution  to  the  second  problem  allows  us  to  estimate  how  often  an 
inconsistent  interconnection  is  likely  to  occur,  and  thus  permits  us  to  judge  if 
the  benefits  of  decentralized  control  offset  its  limitations. 

2.0  BIT  PARTITIONING  CONTROL  STRATEGIES 

One  economical  method  for  realizing  distributed  control  is  to  carry  it  out 
via  the  same  pathways  that  the  data  occupies.  The  message  transmitted  by  a 
source  therefore  would  contain,  in  sequence,  the  following:  a)  a  header  to  re¬ 
quest  the  path  and  identify  the  destination,  b)  data  bits  comprising  the 
information  to  be  sent  to  the  destination,  and  c)  an  end  of  message  word  to 
indicate  the  end  of  transmission.  In  order  to  place  the  network  in  a  state  to 
receive  the  next  path  request  by  the  next  source,  the  end  of  message  would  also 
initiate  a  path  clearing  operation. 

Decentralized  control  implies  that  there  is  no  global  sequencing,  thus 
communication  between  chips  is  accomplished  via  some  form  of  self-timed  protocol. 
Within  a  chip,  however,  the  transactions  may  be  carried  out  with  a  local  clock 
or  with  a  self-timed  methodology.  Although,  in  general,  a  self-timed  protocol 
requires  more  wires  (and  hence  pins)  than  a  clocked  scheme,  if  properly  designed 
it  has  the  advantages  of  being  able  to  be  easily  expanded  (e.g.,  step  and 
repeat)  without  a  recomputation  of  timing  constraints,  and  usually  also  has  the 
highest  bandwidth. 

Here  we  describe  a  switching  module  and  communications  signalling 
protocol  that  contains  provisions  for: 
path  establishment 

transfer  of  data  from  source  to  destination 
detection  of  a  blocked  path 

indication  of  end  of  transmission  with  path  clearing 
initiation  of  retry  for  path  establishment 
detection  of  word  inconsistency 


2.1  Switch  Module  Characteristics 

Figure  5  illustrates  the  primitive  switch  module  and  associated  signals  that 


are  used  to  form  the  self-timed  crossbar  interconnection  network.  These 

modules  are  utilized  as  depicted  in  Figure  1.  Each  module  has  4  connections 

per  side  for  a  total  of  16  connections  per  module.  The  sides  are  identified  via 

subscripts,  L  =  Left,  R  =  Right,  T  =  Top,  and  B  =  Bottom.  The  left  and  top  side 

connections  correspond  to  communication  links  with  a  data  source,  while  the 

right  and  bottom  side  connections  correspond  to  communication  links  to  a  data 

destination.  The  left  side  of  a  switch  module  contains  a  path  R^  that  is  used 

0  ^ 

if  a  binary  one  is  to  be  sent  to  the  module,  and  a  path  R  that  is  used  if  a 

Li 

binary  zero  is  to  be  sent  to  the  module.  Likewise  for  the  top  side.  To  achieve 
the  self-timed  operation  an  acknowledgment  of  the  receipt  of  a  data  bit  must  be 
returned  to  the  data  source  and  this  is  performed  by  the  path  for  the  left 
side  and  the  path  A^,  for  the  top  side.  When  the  data  source  receives  a  signal 
from  the  A  line  it  indicates  to  it  that  the  last  transmitted  data  bit  has  been 
successfully  received  by  the  destination.  All  signalling  events  are  encoded  as 
changes,  thus  a  logic  change  in  a  line  represents  the  occurence  of  a  signal  on 
that  line.  For  example,  if  the  line  R^^  changes  from  a  0  to  a  1  or  vice  versa 
it  represents  the  transmission  of  a  one  to  the  switching  module.  Since  no 
return  to  zero  is  required  to  achieve  a  quiescent  condition  when  using  such 
transition  encoding,  the  number  of  transitions  per  bit  can  be  minimized  thus 
maximizing  the  data  rate.  In  addition  to  the  R1,  R  and  A  signals,  on  each  side 
of  the  module  there  is  a  negative  acknowledge  signal,  NA,  whose  purpose  will  be 
described  later. 

2.1.1  Switch  States 

The  switch  module  can  be  considered  as  having  five  data  connection  states. 
These  are  illustrated  in  Figure  6  and  correspond  to:  the  horizontal  and  verti¬ 
cal  paths  inactive  (State  I);  the  horizontal  path  active  (State  H) ;  the 
vertical  path  active  (State  V) ;  the  horizontal  and  vertical  paths  both  active 
(State  HV) ;  and  the  left  side  to  bottom  side  corner  path  active  (State  C) . 
Notice  that  a  data  connection  and  switch  state  correspond  to  data  from  the  top 
side  to  the  right  side  is  not  needed  in  the  crossbar  interconnection  network. 

2.2  Path  Establishment 

The  goal  is  to  establish  a  circuit  switched  path  from  source  S^  to  desti¬ 
nation  D^.  We  now  show  how  requests  that  such  a  path  be 
established . 

Assume  that  all  modules  are  in  their  inactive  state  and  let  the  row, 
column  identification  of  sources  and  destinations  correspond  to  that  shown  in 
Figure  3.  In  order  to  establish  a  path  from  to  it  is  necessary  that: 
all  switch  modules  in  row  i  from  column  i  to  column  k-i  be  set  to  their 
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horizontal  active  state,  the  module  at  the  intersection  of  row  i  and  column  k 
be  set  to  its  corner  state,  and  all  modules  in  column  k  from  row  k-i  to  row  i 
be  set  to  their  vertical  state.  This  will  complete  the  path.  To  achieve  these 
states,  the  source  sends  a  header  word  (which  is  merely  a  special  bit  string) 
of  length  k  into  the  network  from  its  connection  at  port  i.  Thus,  if  the 
destination  is  ,  the  bit  string  will  be  of  length  6.  This  bit  string  is 
formatted  such  that  its  first  k-i  bits  are  zeros  and  its  kth  bit  is  a  one.  For 

example,  to  request  a  path  to  D,  the  sequence  000001  would  be  sent  by  source  S. 

b  0  1  1 
by  making  proper  changes  on  lines  R  and  R  . 

The  module  response  to  header  bits  arriving  from  the  left  is  as  follows: 

The  module,  which  has  an  inactive  horizontal  path  (it  is  in  the  I  state)  exam¬ 
ines  the  first  bit  that  it  receives.  If  this  bit  is  a  zero  the  module  changes 
to  the  horizontal  state,  absorbs  (e.g.,  does  not  pass  on)  this  first  bit,  and 
then  generates  an  acknowledgment  on  to  indicate  receipt  of  the  first  bit. 
Subsequent  header  bits  arriving  at  this  module  find  it  in  the  horizontal  active 

state  and  these  bits  (either  zero  or  one)  are  merely  passed  on  to  R^  or  . 

R  R 

If  the  first  header  bit  that  an  inactive  module  receives  is  a  one,  then  the 
module  enters  the  comer  state  and  passes  this  bit  out  its  bottom  side  via  R^g. 

In  the  example  under  consideration,  the  first  five  modules  in  row  i  would  enter 
the  horizontal  state  and  the  sixth  module  would  enter  the  corner  state.  Thus 
the  proper  column  has  been  found  and  the  data  connections  to  this  point  have 
been  established.  The  module  immediately  below  the  corner  module  will  now  re¬ 
ceive  a  one  via  a  change  on  its  R^  line.  A  module  which  is  in  the  inactive 
vertical  state  that  receives  a  one  from  the  top  enters  the  vertically  active 

state  and  passes  this  one  onto  the  module  beneath  it  via  R1  .  (Note  that  a  mod- 

B 

ule  will  never  receive  a  zero  as  its  first  bit  from  the  top.)  This  process  is 
repeated  along  column  k.  When  destination  k  receives  its  first  one  it  knows 
that  this  is  the  last  bit  in  the  header  word.  The  destination  then  produces  a 
signal  on  line  A^  of  the  first  module  in  column  k  and  this  signal  passes  back¬ 
ward  along  the  path  through  column  k  via  Ag  to  Aj,  of  each  module,  then  at 
module  i,  k  from  A^  to  A^,  along  row  i  from  A^  to  A^,  eventually  reaching  the 
source  S^.  Upon  receipt  of  this  acknowledgment  to  the  kth  bit  of  its  header 
word,  the  source  knows  that  the  path  to  the  destination  has  been  completed  and 
that  the  destination  is  ready  to  accept  data.  (If  the  destination  were  not  ready, 
it  would  have  delayed  issuing  the  acknowledge).  The  source  may  now  transmit 
data. 


2.3  Transmission  of  Data 

The  source  transmits  data  bits  by  sending  events  on  R  to  R^  to  the  module 


in  row  i  column  1.  This  module  passes  the  bits  on  via  the  established  pathway 
to  the  destination.  Each  data  bit  is  acknowledged  by  the  destination  back  to 
the  source  via  the  acknowledge  lines. 

2.4  Blocked  Path 

The  above  description  assumes  that  there  are  no  other  paths  in  use  at  the 
time  that  the  to  connection  is  requested.  If  another  source  has  previ¬ 
ously  established  a  connection  to  then  a  provison  must  be  introduced  into  the 

interconnection  network  to  indicate  to  the  source  S.  that  a  blocked,  or 

1 

unavailable  path  has  been  encountered.  We  adopt  this  procedure  rather  than 
merely  "waiting"  at  the  module  because,  as  will  be  discussed  later,  the  waiting 
method  can  produce  a  deadlock  condition.  Thus,  it  is  necessary  to  transmit  a 
signal  back  from  the  blockage  point  to  S^.  This  means  that  in  general  will 
either  receive  an  acknowledge  signal  (path  completed)  or  a  negative  acknowledge 
(path  blocked),  hence,  this  requires  one  bit  of  information.  To  transmit  these 
two  conditions  in  a  delay  insensitive  manner  requires  the  addition  of  a  fourth 
line  to  the  module  configuration.  This  is  the  NA  signal  line  illustrated  in 
Figure  5.  This  signal  is  used  during  the  path  establishment  phase  of  the 
protocol  and,  as  will  be  described  in  a  later  section,  is  also  used  during  the 
end  of  data  transmission  phase. 

Observe  that  when  we  permit  concurrent  paths  to  be  established  in  the  cross¬ 
bar  network  the  definition  of  inactivity  must  be  interpreted  to  include  just  the 
inactivity  of  the  path  that  is  required.  For  example,  suppose  a  header  bit 
enters  a  module  in  which  the  vertical  path  is  active  and  the  horizontal  path  is 
inactive.  If  the  header  bit  is  a  zero  then  the  new  state  of  the  module  becomes 
horizontal  active  and  vertical  active  (State  HV)  and  an  A  is  returned  to  the 
source.  If  the  header  bit  is  a  one  (requesting  the  corner)  the  module  state 
remains  the  same  (State  V)  and  a  NA  is  returned  to  the  source  to  indicate  that 
the  path  is  not  available.  In  both  cases  the  A  or  the  NA  move  back  along  the 
partially  established  path  to  the  source.  However,  as  the  NA  signal  moves 
through  a  module  it  must  clear  each  module  by  placing  it  in  the  inactive  hori¬ 
zontal,  inactive  vertical,  or  inactive  horizontal  and  inactive  vertical 
conditions,  thereby  making  the  path  available  for  the  next  request. 

2.5  End  of  Transmission 

The  protocol  described  so  far  allows  for  the  establishment  of  a  path,  the 
transmission  of  data  over  a  path,  the  detection  of  a  blocked  path,  the  auto¬ 
matic  resetting  of  individual  module  states  along  a  blocked  path,  all  in  a 
self-timed  environment.  Now  we  explain  how  a  path  that  has  been  successfully 
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established  between  source  and  destination  can  be  relinquished  and  placed 
in  the  inactive  state  at  the  conclusion  of  the  data  transfer. 

The  data  transmission  is  arranged  so  that  a  special  bit  stream  combination 
(e.g.,  a  special  character)  is  reserved  to  indicate  to  the  destination  an  End  of 
Transmission.  For  example,  if  ASCII  characters  were  being  sent  over  the  inter¬ 
connection  path  then  the  stream  00000001  (hexadecimal  04)  might  be  used  as  an 
End  of  Transmission  indicator.  The  destination  continuously  decodes  the 
received  data  bit  stream  and  provides  an  acknowledge  signal  after  every  bit. 

When  the  destination  detects  an  EOT  character,  instead  of  supplying  an  A  signal, 
the  destination  produces  a  NA  signal.  This  signal  then  travels  back  along  the 
path,  placing  each  switch  module  in  its  appropriate  inactive  state  (H,  V  or  I) 
and  arrives  at  the  source  indicating  that  the  path  is  now  clear.  Note  that  this 
action  of  the  NA  on  the  individual  modules  is  identical  to  its  action  when  a 
blocked  signal  path  is  encountered,  thus,  no  additional  logic  is  required  by  the 
module  to  implement  this  function. 

2.6  State  Diagram  of  Switch  Module 

The  specific  state  changes  that  an  individual  switch  module  must  make  as  a 
function  of  the  various  input  events  and  the  output  events  that  it  must  generate 
are  summarized  in  the  state  diagram  shown  in  Figure  7. 

2.7  Conflict  Resolution 

The  state  transitions  shown  in  Figure  7  are  predicated  on  the  assumption 
that  requests  to  a  specific  module  to  establish  a  path  from  left  to  bottom  and 
from  top  to  bottom  do  not  arrive  at  the  module  simultaneously.  Notice  that 
since  the  control  is  distributed  such  a  request  condition  can  occur  and  will 
result  in  a  conflict  as  to  which  path  will  be  completed.  To  accomodate  these 
concurrent  requests  in  a  reliable  manner,  it  is  necessary  to  include  an  arbitra¬ 
tion  unit  within  each  of  the  primitive  switching  modules.  The  placement  of  this 
unit  is  shown  in  Figure  8.  There  are  three  cases  in  which  the  behavior  of  the 
module  depends  upon  the  order  in  which  the  concurrent  input  signals  arrive. 

These  cases  are  illustrated  in  Figure  9.  The  first  case  is  the  situation  de¬ 
scribed  above  in  which  the  module  is  in  the  inactive  state  (I)  and  requests 
arrive  from  the  left  and  top.  The  results  of  these  concurrent  requests  are  also 
illustrated  in  Figure  9.  Suppose  a  logic  one  header  bit  arrives  from  source 
at  the  top  of  a  module  and  from  source  at  the  left  of  a  module  simultaneously. 
Then  the  arbitration  unit  selects  one  of  these  requests,  say  S^,  and  passes  the 
request  on.  The  other  request,  from  S^,  is  not  granted  and  a  NA  signal  is 

returned  to  S  .  Note  that  in  contrast  to  a  common  class  of  arbitration  units, 
x 


-8- 

this  arbitration  unit  does  not  hold  the  request  from  Sx  for  later  action  (e.g., 
for  action  when  the  switch  path  becomes  free),  but  generates  a  NA  immediately, 
thereby  allowing  the  source  to  make  its  own  retry  decision.  The  second  and 
third  cases  occur  when  one  path  in  the  module  is  being  cleared  via  an  NA  signal 
and  part  of  this  path  is  being  requested.  Figure  9  shows  the  two  arbitration 
results.  Construction  and  performance  details  of  such  arbitration  units  have 
been  described  in  the  literature. 


2.8  Retry  Protocol  for  Blocked  Path 

When  a  source  requesting  a  connection  to  a  destination  receives  a  NA  after 
sending  the  one  in  its  header  word,  it  knows  that  the  path  is  blocked.  It  may 
then  try  to  reestablish  the  path  immediately  by  retransmitting  the  header  word, 
or  it  may  employ  some  backoff  algorithm  that  generates  a  random  delay  before  the 
next  retry  is  issued. 

2.9  Retry  Protocol  for  Word  Inconsistency 

When  a  source  attempts  to  establish  a  path  to  a  destination  it  should  re¬ 
ceive  an  A  signal  from  each  of  the  Z  planes.  If  the  source  receives  one  or  more 
NA  signals  it  recognizes  that  the  desired  pathway  is  blocked;  thus,  the  other 
pathways,  even  though  they  have  been  established,  are  not  useful.  In  order  to 
retry  the  connection  the  source  sends  an  EOT  character  over  each  of  the  pathways 
on  which  it  received  an  A;  this  will  cause  the  destination  to  generate  an  NA, 
thereby  clearing  them.  When  the  source  detects  all  NA  signals  from  all  Z  planes, 
the  path  is  clear.  The  return  of  a  mixture  of  A  and  NA  signals  to  a  source  indi¬ 
cates  that  another  source  is  requesting  the  same  destination  and  it  must  have 
also  received  a  mixture.  Also,  note  that  this  behavior  would  result  in  a 
deadlock  situation  in  which  no  retry  could  be  started  if  the  waiting  technique 
were  adopted  rather  than  this  path  clearing  method.  However,  in  the  path 
clearing  method  if  both  sources  issue  a  retry  the  probability  is  high  that 
another  blocking  will  occur.  Therefore,  backoff  techniques  (such  as  used  in  the 
Ether  net)  may  also  be  adopted. 

2.10  Pin  Minimization 

Notice  that  if  the  network  is  arranged  in  a  bit  slice  the  self-timed 
switching  module  requires  four  pins  per  port.  In  contrast,  a  central  control, 
although  presenting  problems  such  as  clock  skew,  only  requires  one  pin  per  port. 
Thus,  the  pin  efficiency  of  the  self-timed  module  is  very  poor.  What  is  the 
minimum  number  of  pins  that  will  still  maintain  the  self-timed  discipline?  We 
have  found  that  it  is  possible  to  convey  the  same  information  across  one  of  the 
module  sides  with  three  bidirectional  paths  rather  than  the  four  undirectional 
ones.  The  technique  for  this  encoding  is  described  next. 
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to  guarantee  a  self-timed  discipline  we  recognize  that  when  a  signal 
arrives  on  a  line  we  cannot  send  the  next  signal  on  this  same  line  since  two 
changes  on  the  same  line  (without  a  change  on  a  second  line)  violate  the  self- 
timed  rules.  Let  us  postulate  a  module  that  has  three  bidirectional  lines  as 
shown  in  Figure  10,  and  assume  that  the  last  change  occurred  on  line  Z  and 
represented  an  acknowledgment  from  the  destination.  Then  the  next  signal  will 
be  from  the  source  and  it  has  two  lines  (X  and  Y)  on  which  it  may  send 

or  R^*.  Suppose  we  let  the  source  send  by  changing  line  X.  When  the 
destination  replies  it  now  has  two  lines,  W  and  Z,  that  it  can  use  to  indicate 
A  or  NA.  Notice  that  both  the  source  and  destination  always  have  two  lines, 
other  than  the  line  on  which  the  last  change  occurred,  on  which  they  can  send 
their  one  bit  of  information.  But  the  coding,  in  contrast  to  the  four  line  uni¬ 
directional  module,  is  not  constant.  During  one  transfer  a  change  on  X  may 
represent  an  A  while  during  the  next  transfer  a  change  on  Y  may  represent  an  A. 

It  is  the  responsibility  of  the  source  and  destination  to  keep  track  of  this 
variation  in  coding  in  order  to  determine  which  line  may  be  changed  next. 

Although  the  network  could  be  constructed  entirely  with  these  bidirectional  line 
modules,  this  complicates  the  module  design.  Another  approach  that  we  are 
exploring  is  to  merely  use  this  technique  at  the  interface  between  the  crossbar 
network  and  the  ports.  This  is  illustrated  in  Figure  11.  It  does  require  the 
design  of  a  second  type  of  module,  but  the  modules  in  column  1  and  row  1  are 
usually  special  since  they  must  receive  and  drive  signals  to  the  external  pins. 

This  technique  reduces  the  pin  requirements  for  the  self-timed  module  to 
three  per  port. 

2.11  Nearly  Self- Timed  Module 

Although  we  cannot  reduce  the  pin  count  further  and  maintain  the  completely 
self-timed  nature  of  the  network,  it  is  possible  to  create  a  network  that  has  the 
minimum  number  of  pins  (1  per  port)  and  have  local  control  of  the  path  establish¬ 
ment  and  data  transfer.  This  is  accomplished  by  means  of  a  local  clock  in  each 
chip.  This  local  clock  is  distributed  only  through  the  chip  (there  is  no  clock 
signal  or  relative  timing  requirements  between  chips)  and  the  communication  be¬ 
tween  modules  on  a  chip  or  between  chips  is  via  the  standard  UART  type  of  serial 
data  transmission.  A  module  of  this  type  is  illustrated  in  Figure  12.  Here  the 
interconnection  pathways  are  bidirectional  and  module  communication  is  via 
serially  encoded  characters.  After  one  character  is  sent  the  destination 
module  acknowledges  via  the  return  of  a  serially  encoded  character  thus 
providing  handshaking.  This  is  a  very  simple  yet  elegant  technique  and  we  are 
examining  it  further. 


3.0  PROBABILITY  OF  RETRY 

In  Section  2.0  a  scheme  was  presented  which  permitted  use  of  a  distri¬ 
buted  routing  control,  bit-sliced  architecture  for  interconnection  networks. 

As  indicated,  propagation  delay  differences  may  occur  across  the  bit  slices 
resulting  in  different  interconnections  being  established  on  the  various  net¬ 
work  planes.  While  such  inconsistencies  can  be  detected  and  corrected  through 
retry  procedures,  it  is  important  to  quantify  how  often  such  retries  are  en¬ 
countered  so  that  performance  degradation  can  be  evaluated.  Therefore,  the 
goal  of  the  analysis  presented  here  will  be  to  evaluate  the  probability  that 
an  inconsistent  word  will  develop  for  a  source  upon  requesting  a  desti¬ 
nation  D^,  given  the  statistical  properties  of  the  requests  from  the  N' 
sources  and  a  simple  propagation  delay  model  for  the  interconnection  network. 
For  the  purpose  of  this  analysis,  we  will  assume  that  the  request  process  for 
each  source  has  a  Poisson  distribution  with  rate  X,  and  that  these  processes 
are  independent.  We  shall  also  assume  that  the  requests  are  uniformly  distri¬ 
buted  across  the  destinations. 

In  the  next  section,  a  simple  example  is  provided  to  illustrate  the  affect 
of  propagation  delay  variations  across  the  planes  of  a  bit  partitioned  net¬ 
work.  Following  this,  an  expression  is  derived  for  the  probability  of  an  in¬ 
consistent  word.  Then  a  model  is  presented  to  account  for  propagation  delay 
variability.  In  the  last  section,  the  probability  of  an  inconsistent  word  is 
evaluated  as  a  function  of  the  number  of  network  ports,  bit  plane  partitions, 
and  request  arrival  rate  for  a  pipelined  crossbar  structure. 

3.1  Illustration  of  the  Affects  of  Path  Delay  Variability 

There  are  two  factors  that  influence  the  probability  that  an  inconsistent 
word  will  develop:  the  variability  in  propagation  delay  of  an  interconnection 
path  from  plane  to  plane,  and  the  probability  that  an  arbitration  module  will 
not  award  the  path  to  the  first  request.  For  large  networks  this  latter  factor 
can  be  shown  to  have  negligible  effect  on  the  probability  of  retry  and  it  is 
reasonable  to  assume  that  one  has  a  perfect  arbitration  module.  To  illustrate 
the  affect  of  the  variability  of  the  path  propagation  delay  across  the  planes 
consider  the  following  example: 

Let  S.  and  S.  make  simultaneous  requests  for  a  path  to  D  .  Let  the 

1  J  K 

delay,  along  the  path  from  to  the  arbiter  at  row  j  column  k  in 
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plane  p  be  a  constant,  for  all  planes,  that  is: 


Si  =  Ki 


for  p  =  1 ,  2 ,  .  .  .  z 


of  the  planes,  that  is: 


I  K.+A 
3 


for  p  =  1,  2,  ...  z 
for  p  =  z 


where  A  >_  0. 

The  received  message  then  can  be  expressed  as 


1 


tsJ  • 

.  Sz) 

3 

for 

k.+a 

3 

Dk  =  < 

{si  * 

.  Si> 

for 

Ki  < 

{$1i  • 

...  s*"1  sz> 

otherwise 

V 


Observe  that  for  a  range  of  propagation  delays  there  will  always  be  an  incon¬ 
sistent  connection.  If  one  wanted  to  tailor  this  set  of  paths  from  to 
and  from  to  such  an  inconsistency  could  not  occur,  independent  of  the 
time  separation  between  the  source  requests,  it  could  only  be  accomplished  by 
constructing  each  of  the  Z  paths  associated  with  to  have  a  constant  delay, 
K.^,  and  each  of  the  Z  paths  associated  with  Sj  to  have  a  constant  delay  . 
Obviously,  this  is  not  possible.  Therefore,  it  is  important  to  quantify  how 
often  such  inconsistencies  will  arise. 


3.2  Probability  of  an  Inconsistent  Word 

The  goal  of  this  section  is  to  derive  an  expression  for  the  probability 
that  an  inconsistent  word  will  develop  when  a  source  makes  a  request  for  a 
destination.  Due  to  the  combinatorial  complexity  of  this  problem  (especially 
when  N'  is  large),  we  will  not  attempt  to  obtain  an  exact  solution.  Indeed, 
our  overall  goal  is  simply  to  show  that  the  probability  of  a  word  inconsistency 
is  small  under  typical  operating  conditions.  To  this  end,  we  will  only 
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determine  an  expression  which  represents  an  upper  bound  for  this  probability. 
Several  models  have  been  investigated  and  differ  only  in  the  extent  to  which 
approximations  are  used,  and  hence,  in  the  tightness  of  their  bounds.  The 
highest  level  model,  to  be  described  here,  provides  the  loosest  bound,  but  is 
(to  some  extent)  the  asymptote  that  the  other  models  approach  as  the  number 
of  pathwidth  partitions  become  large  (e.g.  Z  >_  16). 

3.2.1  Definition  of  PIW 

Let  us  formally  define  what  is  meant  by  "the  development  of  an  incon¬ 
sistent  word".  We  distinguish  three  possible  outcomes  which  can  occur  when 
a  source  S^  requests  a  destination  may  capture  1)  all,  2)  some, 

or  3)  none  of  the  network  path  partitions  required  for  communication  with  D^. 
In  the  event  that  only  some  of  the  partitions  are  captured,  we  say  an  incon¬ 
sistent  word  has  developed.  Thus  we  have: 

PIW  =  PROB{  must  retry  due  to  an  inconsistent  word} 

=  PROB{  captures  some  (but  not  all)  of  the  network  planes} 


Since  these  three  outcomes  are  disjoint  and  describe  the  entire  outcome 
space,  then  for  computational  purposes,  we  may  evaluate  PIW  in  terms  of 
these  other  two  events.  Therefore,  let: 

PIW  =  1  -  £pROB{S^  captures  all  of  the  planes} 

+  PROB{Si  captures  none  of  the  planes }J 

For  the  sake  of  a  shorter  notation,  this  becomes: 

PIW  =  1  -  [ P { ALL }  +  P{ NONE}] 


In  the  next  section,  we  investigate  how  P{ALL}  is  determined. 

3.2.2  Capture  Process  for  Source  S^ 

To  understand  what  must  occur  if  S^  is  to  capture  all  of  the  partitions 
(planes),  we  refer  to  Figure  3.  S^  must  capture  all  of  the  switching  modules 


in  the  dotted  path  to  D^,  and  it  must  do  this  for  all  of  the  Z  planes.  Since 
the  horizontal  path  through  a  module  is  captured  without  contention,  the  only 
switch  modules  that  might  experience  contention  at  are  those  in  the  desti¬ 
nation  column.  At  each  of  these  column  modules,  it  will  compete  for  temporary 
use  of  the  switch  with  the  module's  other  request  line.  Let  us  identify  a 
module  by  its  row  (R)  and  column  (C) ,  e.g.  MOD  <R,C>.  Then,  for  MOD  <i,k>, 
this  other  request  can  come  from  any  one  of  the  sources  "above"  S^,  i.e.  any 
Sj  for  which  j>i.  All  of  these  other  sources  will,  so  to  speak,  "fight  it 
out"  to  see  which  source  request  gets  to  compete  for  this  module.  For  Mod<j,k> 
where  j<i,  this  request  can  only  come  from  one  source,  S. 


Before  we  can  determine  the  probability  that  will  capture  an  arbitrary 
switch  module,  we  must  introduce  the  concept  of  an  "uncertainty  interval" .  This 
is  done  in  the  next  section. 


3.2.3  Definition  of  the  "Uncertainty  Interval" 

If  two  sources  S^and  »  j<i  request  the  same  destination,  there  are 
two  factors  that  determine  which  of  the  sources  captures  the  required  path  on 
any  given  plane.  These  are: 

1)  The  relative  times  that  the  requests  enter  the  network 

2)  The  path  delays  between  the  network  inputs  and  the  arbitrating 
switch  module  (e.g.  MOD  <j,k>) 

If  both  of  these  factors  are  known,  we  can  determine  which  source  obtains  the 
path.  If  only  the  path  delays  are  known,  we  can  specify  the  request  times  of 
S.  (relative  to  the  request  time  of  S^)  which  result  in  S.  capturing  the  path 
and  those  times  which  result  in  capturing  the  path.  This  is  shown  in 
Figure  13,  where  t^  represents  the  request  time  of  Sp  There  is  ideally 
only  one  request  time  for  S^which  results  in  an  unpredictable  outcome,  although 
even  this  time  is  eliminated  under  the  assumption  of  perfect  arbiters.  The 
unpredictable  time  is  labelled  in  the  figure  as  t^. 

Let  us  now  presume  that  there  is  some  uncertainty  in  the  path  delays  to 
MOD  <j,k>.  For  instance,  let  the  path  delay  for  vary  between  6^(min) 


If  represents  the  request  time  of  Sj ,  then  this  request  will  arrive  at 

the  arbiter  sometime  between  t.  +  6 . (min)  and  t.  +  6 . (max) .  We  shall  refer 

J  J  11 

to  this  as  S.'s  "arrival  interval".  Similarly,  S^'s  "arrival  interval"  will 

be  between  t^+SOnin)  and  t^+6(max).  These  are  shown  in  Figure  14.  From  the  figure, 

we  see  that  as  long  as  t.  occurs  before  the  time  labelled  t  or  after  the 

1  “ 

time  labelled  tD,  the  outcome  as  to  which  source  captures  the  path  is  pre- 

D 

dictable:  if  t.  occurs  before  t  ,  then  S.  captures  the  path;  if  t.  occurs 
1  “  1  1 
after  tg,  then  captures  the  path.  If,  however,  t.  occurs  between  t^  and 

t_,  the  "arrival  intervals"  for  the  two  sources  will  overlap,  and  the  outcome 

O 

will  be  uncertain.  For  this  reason,  the  composition  of  these  request  times 
will  be  referred  to  as  "S  's  uncertainty  interval  with  respect  to  S  ". 

i  i 

Although  we  can  directly  determine  the  values  of  t  and  t„  in  terms  of 

A  n 

the  other  parameters,  our  interest  only  lies  in  the  length  of  this  interval. 

From  the  figure,  it  is  clear  that  the  length,  designated  as  AT.,  is  equal 
to  the  sum  of  the  lengths  of  the  arrival  intervals. 

3.2.4  Evaluation  of  P{ALL} 

Let  us  now  determine  the  probability  that  captures  MOD<j,k>  (for 

arbitrary  j  less  than  i)  on  all  of  the  planes,  given  that  the  request  reaches 

each  module.  If  S.  does  not  make  a  request  for  D  during  its  "uncertainty 
J 

interval  with  respect  to  S^",  then  captures  the  module  outright  (i.e.  with 

probability  =  1)  on  all  of  the  planes*.  If,  however,  S.  does  request  D, 

3  K 

during  the  uncertainty  interval,  then  the  probability  that  captures  the 

module  for  any  individual  plane  is  p(j)  where  p(j)  will  be  determined  by  the 

type  of  request  process  assumed  and  the  delay  characteristics  of  the  network. 

Thus,  the  probability  in  this  case  that  S^captures  the  module  on  all  of  the  Z 
2 

planes  is  p(j)  .  Therefore,  the  probability  that  captures  MOD<j,k>  on  all 
of  the  network  planes  is  equal  to: 

P{S,  doesn't  req.  D,  during  its  uncertainty  interval}  *  1 
J 

2 

+  P{S,  does  req. D  during  its  uncertainty  interval}  *  p(j) 

J  k 


*  Recall  that  we  are  working  under  the  assumption  that  the  destination 
is  idle.  This  eliminates  the  possibility  that  S  makes  a  request  before 
the  uncertainty  interval,  J 
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Let  us  now  determine  the  probability  that  captures  MOD<i,k>  on 
all  of  the  planes,  given  that  the  request  reaches  each  module.  If  none  of 
the  sources  above  request  during  their  uncertainty  intervals  (each  with 
respect  to  S^) ,  then  captures  the  module  outright  for  all  of  the  planes. 
Since  these  sources  act  independently,  the  probability  that  this  occurs  is 
just  the  product  of  the  probabilities  that  each  source  has  of  not  requesting 
during  its  uncertainty  interval.  If  one  (or  more)  of  the  sources  does 
request  (during  its  uncertainty  interval) ,  then  the  probability  that 

captures  the  module  for  any  individual  plane  is  p(j),  where  j  corresponds 
with  the  index  of  the  source  whose  request  reached  the  module.  Thus  the 
probability  in  this  case  that  S^captures  the  module  on  all  of  the  Z  planes 
is  p(j)  .  Therefore,  the  probability  that  captures  MOD<i,k>  on  all  of 
the  network  planes  is  equal  to: 

N' 

n  P{S.  doesn't  req.  D  during  UI } 

j-i+1  J 

+  1 1-n  P{S.  doesn't  req,  D  during  UI}1*  p(j)Z 

We  now  evaluate  the  probability  that  ar  arbitrary  source  does  not 

make  a  request  for  destination  during  is  uncertainty  interval  with 

respect  to  source  S_^,  recalling  that  we  have  assumed  Independent  Poisson 

request  processes  with  rate  \  for  all  of  the  sources.  It  is  easily  shown 

that  the  probability  that  S.  does  not  make  a  request  for  D  during  its 

J  k 

uncertainty  interval  is  related  to  the  probability  that  does  not  make 
a  request  at  all  during  its  uncertainty  interval  as  follows: 


-16- 


Finally,  the  probability  that  captures  the  path  to  on  all  of 
the  network  planes  (i.e.  P<ALL>)  equals  the  product  (over  j  =  1  to  i)  of 
the  probabilities  that  captures  switch  module  MOD<j ,k>  on  all  of  the 
planes,  given  that  the  request  arrived  at  each  module,  i.e. 

i 

P<ALL>  =  II  P{S^  captures  MOD  <j,k>  on  all  planes  given  arrives 
j=l  at  each  module} 

3.2.5  Evaluation  of  P{NONE} 

Let  us  now  consider  P{NONE}.  Due  to  the  combinatorial  complexity  of 
this  probability,  an  analysis  similar  to  that  of  P{ALL}can  not  be  done. 
Therefore  we  shall  simply  attempt  to  show  that  this  term  is  no  larger  than 
P{ALL},  and  in  fact,  is  significantly  smaller  than  P{ALL}  when  the  number 
of  network  partitions  is  large. 

Consider  first  the  probability  that  does  not  capture  an  arbitrary 
plane  P: 

PCS.^  doesn't  capture  arbitrary  plane  P} 

=  1— P{ does  capture  arbitrary  plane  P} 

=  1-P{ALL} I  . 

|  z=  1 

Now,  if  all  of  the  planes  operated  in  lock  step,  i.e.  were  totally  dependent, 
then: 

P{NONE}  =  P{S^  doesn't  capture  arb.  plane  P} 

If  ,  on  the  other  hand,  all  of  the  planes  were  totally  independent,  then: 

P{NONE}  =jp{S^  doesn't  capture  arb.  plane  P}jZ 

*  {  1  - 
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Thus : 


jl-P{ALL}|z=1  jZ  <P{NONE}<  Jl-P{ALL}Jz=1J 

Hence,  as  long  as  P{ALL}is  greater  than  0.5,  P{NONE}will  be  smaller  than 
P{ALL}.  Since  in  actuality  there  is  only  a  slight  dependence  between  planes, 
P{NONE}  tends  toward  the  left  hand  side  of  this  inequality,  vhich  implies  that 
P{N0NE}  will  approach  zero  as  the  number  of  planes  increases.  For  this  reason, 
we  will  neglect  this  term  in  the  evaluation  of  PIW  when  Z  is  large. 

3.2.6  An  Approximation  for  PIW  Given  Z  is  Large 

A  large  reduction  in  the  computation  and  complexity  of  PIW  can  be  made 
when  Z  is  large  (e.g.  >^16).  For  this  case,  we  can  assume  that  p(j)  ^0. 
Substituting  this  approximation  into  the  equations  of  Section  3.3,  we 
get  the  following  result: 

PIW  2  L-P{ALL} 

N' 

=  1-n  P{Sj  doesn't  req  during  UI} 

3  =  1 
(j*i) 

That  is,  P{ALL}is  simply  the  probability  that  none  of  the  other  sources 
request  during  their  uncertainty  intervals.  Since  the  sources  are 
independent,  this  is  just  the  product  of  the  probabilities  that  each  source 
has  of  not  requesting  during  UI. 

3.3  Propagation  Delay  Model 

Here  we  present  a  simple  model  for  propagation  delay  based  on  the  phy¬ 
sical  characteristics  of  the  network.  As  a  signal  propagates  along  one  of 
the  network  paths,  it  will  experience  two  types  of  delay:  switch  module 
delay  and  interchip  path  delay.  (The  intrachip  path  delay  is  very  small  due 
to  the  crossbar  structure  and  can  be  included  in  the  switch  module  delay.) 
Franklin  and  Wann,  in  (&),  have  derived  equations  for  these  two  delays. 

If  we  use  typical  VLSI  values  in  these  expressions  and  assume  that  a  switching 
module  has  three  or  four  levels  of  "steering"  logic,  then  both  of  these  delays 


are  on  the  order  of  10  nsec.  However,  due  to  stray  capacitances,  misaligned 
masks,  etc.,  there  will  be  some  variability  in  these  delays.  Therefore, 
to  model  these,  we  will  consider  both  the  interchip  and  module  delays  to 
be  normal  random  variables,  with  means  of  10  nsec.  If  we  assume  that  the 
actual  delays  will  be  within  fifty  percent  of  the  mean  95%  of  the  time, 
then  the  corresponding  standard  deviation  is  2.5  nsec.  We  shall  also  assume 
that  these  delays  are  all  independent  of  each  other  for  the  purpose  of  simpli¬ 
fication.  In  general,  this  may  not  be  true. 

3.4  PIW  for  a  Pipelined  Crossbar  Network*.  An  Example 

Let  us  now  apply  the  equations  that  we  have  derived,  together  with 
the  propagation  delay  model,  to  determine  the  probability  that  an  incon¬ 
sistent  word  will  develop  for  source  upon  requesting  destination 
(given  that  is  idle).  Our  initial  research  efforts  in  this  area  have  been 
directed  toward  a  pipelined  network  rather  than  the  circuit  switched  net¬ 
work  that  was  described  in  section  2.0.  For  this  reason,  the  example  to 
be  presented  here  will  assume  a  pipelined  approach. 

The  main  difference  between  the  pipelined  and  circuit  switched  approach 
is  that,  in  the  pipelined  model,  each  module  has  memory,  and  data  is  passed 
locally  using  a  handshake  between  modules  rather  than  globally  using  a  hand¬ 
shake  which  always  involves  the  source.  Path  establishment  is  done  in  a 
pipelined  fashion  as  well.  The  source,  handshaking  with  the  first  module, 
simply  pushes  the  routing  header  into  the  pipe,  followed  by  data  (unless  it 
receives  a  blocked  signal).  Therefore,  in  this  approach,  the  source’s  re¬ 
quest  will  reach  the  first  arbiter  (M0D<i,k>)  after  k-1  module  delays  and 
|^|  interchip  path  delays.  Neglecting  the  interchip  path  delays  and  applying 
our  propagation  model  for  the  switch  module  delays,  the  time  it  takes  the 
request  to  reach  the  arbiter  is  then  the  sum  of  k-1  Normal  random  variables, 
each  having  a  mean  of  10  nsec  and  standard  deviation  of  2.5  nsec.  Hence, 
the  arrival  time  at  the  arbiter  will  be  Normally  distributed  with  mean 
10(k-l)  nsec  and  standard  deviation  2 . Sg/k—  1  nsec. 

As  a  first  order  approximation,  this  path  delay  can  be  specified 
as  a  Uniform  random  variable  having  a  minimum  and  maximum  value  which 
correspond  with  the  95%  confidence  interval  cutoff  points  of  the  Normal 
delay  distribution. 


That  is,  let 


5 (min)  =  10(k-l)nsec  -  2(2.5) ^/k-1  nsec 


5 (max)  =  10(k-l)nsec  +  2(2.5)  \/k-l  nsec. 

Hence,  the  "arrival  interval"  for  this  source  at  MOD<i,k>  is  4  (2. 5)^-1 
nsec  long. 

We  can  make  a  similar  analysis  to  the  one  above  for  each  of  the  other 
sources  to  acquire  their  arrival  intervals  to  the  various  arbitors.  We  can 
then  use  these  to  determine  each  source's  uncertainty  interval.  In  general, 
this  will  be  : 

AT  =  4  (2.5)(/£T  d-^/k-l  +  |j-i|  )  for  all  j. 


We  can  then  apply  this  to  the  equations  derived  in  section  3.2  to  obtain  the 
probability  of  an  inconsistent  word  (for  source  S.^  requesting  Dfc)  as  a  function 
of  the  number  of  network  ports,  bit  plane  partitions,  and  the  request  arrival 
rate,!.  Figure  15shows  the  curves  obtained  when  source  N'  requests  destination 
N'  ,  which  yields  the  highest  probability  of  inconsistency  for  the  network.  Note 
that,  for  the  arrival  rates  shown,  the  probability  of  retry  is  relatively  small. 
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FIGURE  4.  CONNECTIONS  FOR  Si  AND  REQUESTING  WITH 
INCONSISTENCY  IN  14th  PLANE  (B*  =>  16,  B  =  1) 


FIGURE  5.  DELAY  INSENSITIVE  SWITCH  MODULE 
(SUBSCRIPTS:  L-LEFT,  R-RIGHT,  T-TOP, 
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FIGURE  6.  THE  FIVE  POSSIBLE  DATA 
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FIGURE  8.  PLACEMENT  OF  ARBITRATION  UNIT 
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FIGURE  9.  CONFLICT  SITUATIONS 
AND  POSSIBLE  RESULTS 


FIGURE  12.  BIDIRECTIONAL  SINGLE  LINE 
SWITCHING  MODULE  (NOT  SELF-TIMED) 


- >■ 


t 

t . 

1 

1 

r* -  3 

I 

C* 

S.  captures  all  S  captures  all  planes  ^me 

^  planes 

FIGURE  13.  PATH  DELAYS  KNOWN  (FOR  PLANE  P) 


FIGURE  14.  PATH  DELAYS  UNCERTAIN  (FOR  PLANE  P) 


FICURE  15.  PROBABILITY  P  ,  THAT  A  WORD  SENT  BECOMES  INCONSISTENT  AND  A  RETRY  IS 
NECESSARY  AS  A  FUNCTION  OF  THE  NETWORK  SIZE  N\  BIT  PLANE  PARTITIONS  Z.  AND 
SOURCE  REQUEST  RATE  A.  SWITCH  MODULE  DELAY  IS  NORMALLY  DISTRIBUTED  WITH  A 
MEAN  OF  10  NSECS  AND  A  2.5  NSEC  STANDARD  DEVIATION.  WORST  CASE  CONDITION  IS 
USED  WITH  A  SOURCE  AT  UPPER  LEFT  CORNER  AND  DESTINATION  AT  LOWER  RIGHT  CORNER 
OF  INTERCONNECTION  NETWORK. 
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1.0  Research  Review 

Our  research  activities  at  Washington  University  (St.  Louis)  are 
centered  on  the  analysis  and  synthesis  of  multiprocessor  systems  and 
associated  communication  networks.  ONR  support  is  principally  in  the 
communications  network  area. 

National  interest  in  tightly  coupled  MIMD  (Multiple  Instruction,  Multiple 
Data  stream)  multiprocessor  computer  systems  has  increased  greatly  during  the 
latter  part  of  the  1970' s.  This  has  been  due  to  1)  the  enhanced  performance 
possibilities  for  such  systems  (e.g.,  increased  computational  power  and  high 
reliability) ,  2)  the  steady  decrease  in  hardware  costs  associated  with  these 
systems,  and  3)  the  realization  that  the  assembly  of  modular  systems  for  the 
solution  of  computationally  intensive  problems  is  now  feasible.  This  realiza¬ 
tion  has  resulted  in  the  initiation  of  a  number  of  multiprocessor  projects 
around  the  country,  several  of  which  will  be  mentioned  later  in  this  discussion. 
Overviews  of  certain  of  the  benefits  (as  well  as  limitations)  of  such  systems 
have  been  presented  by  Enslow  (1977),  and  Kuck  (1977). 

There  are  many  technical  issues  involved  in  the  design  of  these  multi¬ 
processor  systems.  One  of  the  major  constraints  are  those  imposed  by  the 
network  structure  over  which  the  multiple  processors  communicate.  This  is 
not  to  minimize  the  unique  programming  tasks  required  for  such  systems,  but 
the  programming  is  impacted  less  by  technology  and  does  not  appreciably  change 
when  small  alterations  are  made,  for  example,  in  the  number  of  processors 
(e.g.,  say  from  10  to  40).  In  contrast,  this  type  of  change  could  signifi¬ 
cantly  influence  the  design  and  performance  of  the  communication  network  - 
and  thus  the  performance  of  the  overall  system.  As  a  consequence,  the 
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characteristics  of  this  communication  network  becomes  the  critical  factor 
in  establishing  both  performance  and  cost.  This  has  led  to  a  variety  of 
studies  aimed  at  characterizing  and  quantifying  the  performance  of  such 
networks.  The  figure  of  merit  of  the  system  is  determined,  to  a  great  extent, 
by  the  average  bandwidth  that  is  achieved  between  processors.  Since  this 
average  bandwidth  is  primarily  established  by  the  architectural  organization 
of  the  communication  network,  various  structural  configurations  have  been 
investigated  and  compared,  for  example,  see  Anderson,  et.  al.,  (1975), 

Thurber  (1974),  and  Siegel,  et.  al. ,  (1979A).  The  principal  parameter  used 
in  these  studies  has  been  the  effect  that  the  number  of  switches  (i.e., 
complexity)  has  on  the  bandwidth  of  the  path  between  processors,  or  between 
processors  and  shared  memories.  Tradeoffs  between  bandwidth,  switches,  and 
path  blocking  have  been  presented  in  these  and  other  papers.  Major  efforts 
in  the  area  of  network  characterization  and  functional  analysis  are  underway 
at  the  University  of  Illinois  under  Lawrie  (1975),  at  Purdue  University  under 
Siegel  (1979B) ,  and  at  the  University  of  Texas  at  Austin  under  Lipovski  (1979) . 

Implementation  efforts  in  the  multiprocessor  area  are  being  under  taken 
in  a  number  of  places.  Carnegie-Mellon  University  has  had  an  operational 
multiple  processor  system  for  several  years  now  called  C.MMP  (Wulf  and  Bell, 
1972).  Processors  are  connected  by  use  of  a  very  complex  crossbar  switch 
which  lacks  modularity  and  expandability  properties.  Currently  their  efforts 
are  centering  on  the  implementation  of  a  tree  structured  multiprocessor  called 
CM.*.  Another  tree  structured  multiprocessor  is  being  designed  by  Keller 
(1979)  at  the  University  of  Utah  with  support  from  NSF.  At  Purdue,  Siegel 
has  been  working,  with  Air  Force  sponsorship,  on  parallel  processors  for  use 
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in  image  processing.  Widdoes  (1980)  has  been  directing  a  large  multiprocessor 
project  at  the  Lawrence  Livermore  Laboratory  under  the  sponsorship  of  the 
U.S.  Navy.  This  multiprocessor  uses  a  crossbar  switch  to  connect  up  to 
sixteen  computers  each  of  which  is  itself  of  super  computer  power.  The 
System  Development  Corporation  at  Huntsville,  under  contract  to  the  Air  Force, 
is  investigating  the  design  of  a  large  multiprocessor  for  use  in  ABM  appli¬ 
cations.  Dennis  (1974)  at  MIT  has  been  studying  the  design  of  multiprocessors 
of  the  "data  flow"  variety,  and  an  experimental  machine  has  been  implemented 
by  Texas  Instruments.  A  similar  approach  is  being  studied  by  Sullivan  and 
Bashkow  (1977)  at  Columbia  University.  Processing  elements  in  this  case  are 
connected  in  a  Binary-K  cube  arrangement.  Bolt,  Beranek  and  Newman  (BBN)  is 
currently  implementing  a  multiprocessor  for  ARPA.  This  system  uses  an  in¬ 
direct  binary  n-cube  communications  network  and  utilizes  M68000  microprocessors 
(up  to  several  hundred)  as  the  computational  element  in  a  shared  memory  en¬ 
vironment.  At  the  University  of  Texas  (Austin)  implementation  is  proceeding 
on  a  "Reconf igurable  Computer"  (Sejnowski,  1980)  which  will  eventually  contain 
16  bit  slice  processors  of  the  2900  variety,  and  81  memories  connected  through 
a  Banyan  network.  This  work  is  being  supported  by  NSF.  An  interesting 
commercial  multiprocessor  that  recently  became  available  is  the  HEP  (Homogeneous 
Element  Processor)  which  is  being  designed  and  implemented  by  the  Denelcor 
Corporation  of  Denver  under  the  auspices  of  the  U.S.  Army. 

Finally  at  Washington  University  a  multiprocessor  is  being  designed  and 
implemented.  The  processor  gliding  blocks  are  microprocessors  (DEC-LSI-lls) 
while  the  connection  networl  is  a  modular,  pipelined  crossbar  (Franklin,  1979) 
which  has  been  designed  to  have  certain  features  which  make  it  amenable  to 


VLSI  implementation.  The  multiprocessor  systems  work  has  been  supported 
by  NSF,  while  aspects  of  the  communications  network  are  being 
supported  by  ONR. 

All  of  these  efforts  reflect  the  general  view  that  one  important  way  of 
achieving  computer  power  is  through  parallelism.  In  toto  they  represent  at 
the  national  level  a  series  of  experiments,  pursued  by  independent  and 
competing  groups,  aimed  at  evaluating  the  potentials  and  problems  of  multiple 
processor  systems.  As  indicated,  one  important  aspect  of  these  experiments 
is  the  interprocessor  communications  network  used.  While  this  is  of  critical 
importance,  most  efforts  have  concentrated  on  using  relatively  straight¬ 
forward  structures  implemented  in  SSI  and  MSI  technologies.  This  is  acceptable 
as  long  as  the  number  of  processors  in  the  system  is  limited,  and  little  or  no 
expansion  is  forseen.  This  is  the  case  with  most  of  the  multiprocessor  experi¬ 
ments  mentioned  above.  When  one  gets  to  systems  where  thousands  of  processors 
may  be  present,  then  the  communications  network,  both  in  terms  of  its  function¬ 
ality,  and  implementation  details  becomes  critical.  In  almost  all  designs  for 
closely  coupled  multiprocessor  systems,  network  complexity  grows  faster  than 
processor  complexity  (i.e.,  number  of  processors).  This  fact  plus  our  experience 
in  multiprocessor  design  has  reinforced  our  views  on  the  importance  of  VLSI  in 
achieving  physically  small,  reliable,  yet  powerful,  switching  networks.  The 
VLSI  technology  has  the  potential  for  economically  placing  large  parts  of  a 
switching  network  for  a  multiprocessor  system  on  a  single  chip.  Cost  here 
becomes  related  to  chip  area.  Unfortunately,  a  new  challenge  appears :  the 
implementation  of  the  connection  paths  may  use  substantial  amounts  of 
the  chip  area,  thus  limiting  the  area  available  to  the  switch  elements  them¬ 
selves.  This  has  the  effect  of  reducing  the  size  of  a  switching  network  that 


-5- 


can  be  fabricated  on  a  chip  of  a  given  size.  The  time  delay  associated 
with  the  connection  paths  also  contributes  to  the  overall  delay,  thus 
directly  effecting  bandwidth.  Area,  topology  and  layout,  basically  ignored 
in  traditional  communication  network  analysis,  become  important  interrelated 
factors  in  VLSI  network  design.  Some  of  these  major  issues  have  been  discussed 
by  Franklin  (1980)  and  this  is  the  research  currently  being  supported  by  ONR. 

We  are  not  aware  of  any  other  effort  in  the  United  States  in  which  the 
VLSI  modular  approach  to  communications  network  (including  actual  implementation 
issues)  is  being  considered  and  that  is  principally  what  makes  the  effort  at 
Washington  University  unique.  Perhaps  the  most  analogous  work  is  that  started 
by  Rung,  et.  al.,  (1979)  and  his  student  Thompson  (1979),  in  which  they  are 
exploring  how  certain  conventional  algorithms  (such  as  matrix  multiplication, 
tree  searching  and  shuffle-exchange)  can  be  efficiently  mapped  into  VLSI. 
However,  they  are  not  explicitly  investigating  communication  network  problems. 

Thus  one  of  our  major  goals  is  to  be  able  to  provide  both  a  theoretical 
and  practical  analysis  of  various  communication  network  configurations  in  the 
VLSI  domain,  and  hopefully  to  make  recommendations  regarding  a  modular 
approach  to  their  design.  If  we  are  successful,  this  work  could  form  the  basis 
for  the  wide  spread  exploitation  of  the  computational  power  of  multiprocessor 


systems. 


2.0  "Cutting  Edges"  in  Multiprocessor  Communications  Network  Research 


The  questions  related  to  VLSI  implementation  of  communications  networks 
will  remain  a  critically  important  research  area  for  sometime  to  come.  There 
are  a  variety  of  design  options  available  including:  a)  network  topology 
(e.g. ,  crossbar.  Banyan,  binary  K-cube,  etc.),  b)  communications  style  (e.g., 
circuit  switching,  packet  switching,  pipelining),  c)  general  organization 
(e.g.,  number  of  data/control  lines,  synchronous /asynchronous  communications, 
pin  usages  and  limitations,  etc.).  It  will  likely  take  some¬ 

time  before  the  most  rewarding  of  these  approaches  is  determined.  Other 
research  activities  will  be  stimulated  by  this  work  and  a  broader  view  of 
these  problems  is  presented  in  section  3.0. 

A  central  issue  for  sometime  will  be  the  problem  of  pin  limitations.  To 
accomodate  this  constraint,  while  exploiting  the  large  component  densities 
available  in  VLSI,  will  require  further  investigations  in  multiple  computer 
architectures.  Architecture  explorations  which  attempt  to  merge  processing 
and  network  elements  on  the  same  chip  will  be  pursued  and  just  what  the 
communications  requirements  are  for  different  problem  classes  (Franklin,  1978) 
will  be  explored  in  depth. 

This  leads  directly  to  the  question  of  how  one  optimally  can  exploit 
parallelism  in  different  applications  areas.  Communications  networks  which 
are  tailored  to  application  problem  solution  algorithms  represents  an  important 
research  area  whose  full  potential  is  not  known.  In  this  area, "cutting  edge" 
research  will  attenpt  to  define  how  one  effectively  exploits  VLSI  technology 
in  the  context  of  multiprocessor  architectures,  particular  application  areas, 
and  parallel  solution  algorithms. 


yg 
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3.0  Future  Research  Opportunities 

While  research  into  particular  and  specific  problems  in  Electronic 
Systems  Theory  will  continue  to  be  fruitful  (e.g.,  reliability  theory, 
differential  games,  etc.,)  it  appears  that  two  interrelated  general  research 
themes  will  be  of  increasing  important  in  the  future.  The  first  concerns 
the  design  of  high  performance  digital  systems,  while  the  second  concerns 
the  problem  of  designing  and  managing  systems  of  rapidly  growing  complexity. 

In  the  past  the  armed  forces  have  been  able  to  make  effective  use  of 
digital  computers  and  digital  systems  which  have  been  developed  and  marketed 
primarily  for  nonmilitary  purposes.  General  purpose  computers,  and  more 
recently  various  popular  microprocessors  have  had  a  sizeable  role  in  many 
military  systems.  Increased  system  performance  has  stemmed  in  large  part 
from  developments  in  basic  technology  which  have  resulted  in  increased  device  perfor¬ 
mance.  While  this  trend  will  continue,  it  is  clear  that  many  more  complex 
architectural  and  design  options  which  are  potentially  rewarding  to  the 
Navy  will  not  be  explored  or  developed  for  the  civilian  commercial  market 
due  to  their  cost,  ongoing  commercial  compatibility  requirements,  and  the 
specialized  performance  and  environmental  needs  of  the  military.  Of  course 
this  has  always  been  true,  however,  with  the  advent  of  inexpensive  micro¬ 
processors  on  the  one  hand,  and  custom  VLSI  chips  on  the  other  hand,  the 
design  options  available  appear  to  be  growing  rapidly.  As  a  consequence, 
research  opportunities  should  be  directed  towards  exploring  those  options 
which  are  generally  not  being  pursued  actively  by  the  civilian  economy. 

Super  high  performance  digital  systems  can  be  achieved  in  a  number  of 
ways.  At  the  systems  and  digital  level  two  approaches  stand  out.  The  first 
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is  to  develop  high  performance  systems  by  aggregating  commercially  available 
processors  and  microprocessors,  into  specialized  multiprocessors  suited  to 
particular  applications.  Since  the  processors  must  communicate  with  each 
other  in  an  effective  manner,  the  design  of  the  processor  communications 
network  is  of  particular  importance.  This  is  the  basic  motivation  for  the 
research  work  described  in  section  1.0  of  this  review.  The  second  approach 
is  to  design  specialized  processors  by  developing  customized  VLSI  chips  or 
chip  sets  which  satisfy  the  requirements  of  limited  applications  groups.  In 
certain  situations  one  or  the  other  of  these  approaches  might  be  advantageous. 
The  key  point  however  is  that  methodologies  for  selecting  an  approach,  for 
predicting  performance  prior  to  implementation,  and  for  performing  design 
studies  and  selecting  among  design  alternatives  are  by  and  large  not  available. 

Thus,  just  when  the  possibilities  are  present  for  creative  design  resulting 
in  large  increases  in  performance,  the  methodologies,  and  tools  for  properly 
performing  such  design  and  for  managing  the  complexity  associated  with  this 
design  environment  are  absent.  This  is  an  area  where  research  opportunities 
exist,  and  where  the  long  term  payoff  could  be  considerable. 

Systems  complexity  is  increasing  at  every  level  of  the  design  cycle. 

At  the  chip  level,  component  densities  on  the  order  of  a  million  or  more  are 
likely  to  be  achieved  within  the  decade.  At  the  processor  level,  multiprocessor 
systems  containing  a  thousand  or  more  processors  will  probably  be  attempted 
in  the  same  general  time  frame.  Ad  hoc  design  methods  are  not  suitable  for 
such  systems  and  more  research  into  structured  design  techniques  and  metho¬ 
dologies  is  needed.  One  area  of  research  requiring  more  work  relates  to  the 


development  of  specification  languages  which  permits  the  designer  to  specify 
and  document  systems  at  the  hardware,  software  and  interface  boundaries. 


Such  specification  languages  should  be  highly  modular  and  structured,  and  should 
be  based  on  a  heirarchical  view  of  systems .  This  heirarchical  view  is  essential  so  that 
designers  at  different  levels  (e.g.,  functional,  timing/sequencing,  and  electrical) 
can  develop  and  contribute  to  the  system  specification. 

Such  specification  languages  represent  tools  not  only  for  documentation 
and  management  of  the  design  process,  but  for  conceptually  developing  models 
of  the  system  to  be  designed.  Such  models,  when  represented  in  a  specification 
language,  should  be  capable  of  automatically  generating  simulation  programs 
for  the  systems  specified.  This  is  critical  if  a  better  handle  on  system 
performance  is  to  be  achieved  prior  to  implementation  and  if  alternative 
designs  are  to  be  compared  in  a  quantitiave  manner. 

A  number  of  critical  research  questions  are  present  with  regard  to  such 
specification/simulation  systems.  For  instance  it  is  not  clear  just  how  to 
achieve  true  heirarchical  simulation  capabilities.  The  high  cost  associated 
with  running  such  simulations  when  even  small  systems  are  investigated  is  of 
concern.  There  may  in  fact  be  a  need  for  certain  types  of  special  purpose 
computers  tailored  to  the  requirements  of  large  system  modeling  and  simulation. 
Finally  the  costs  associated  with  generating  simulation  programs  is  very  high. 
Research  into  automatic  generation  of  such  programs  from  system  specification 
languages  is  needed. 

Given  system  specification  and  simulation  capabilities  at  a  number  of 
levels  in  the  specification  heirarchy,  the  problem  of  automatic  design  can 
be  attached.  That  is,  having  settled  on  a  satisfactory  system  specification 
at  one  level,  can  the  design,  or  system  specification  at  successively  lower 
levels  be  automatically  generated?  This  is  clearly  a  pressing  problem  in  the 
VLSI  domain.  The  current  conventional  wisdom  is  that  while  VLSI  capabilities 
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will  be  exploited  in  the  design  of  regularly  structured  digital  systems 
(e.g.,  memories,  programmable  logic  arrays),  it  may  not  be  possible  to  fully 
exploit  the  available  logic  densities  in  special  purpose,  nonregular  devices. 
The  reason  is  that  design  time  and  costs  go  up  drastically  when  one  begins 
to  design  single  chip  systems  whose  complexity  approaches  a  million  devices. 
Note  that  except  for  certain  high  volume  applications  such  as  microprocessors, 
there  is  relatively  little  commercial  incentive  to  design  at  this  leading 
edge  of  VLSI  capabilities.  In  many  situations,  packaging,  power  supply  and 
similar  costs  already  dominate  the  costs  of  the  logic  itself. 

High  performance,  specialized  military  systems,  however,  require  design 
at  this  leading  edge.  Thus  design  automation  systems  for  VLSI  are  of 
importance.  In  particular  more  research  should  be  directed  at  the  problems 
of  automatic  chip  design  from  systems  specifications  languages.  For  instance 
one  should  be  able  to  specify  a  system  in  a  register  transfer  language,  specify 
a  technology  and  its  key  parameters  (e.g.,  for  NMOS,  feature  size),  and 
automatically  generate  the  appropriate  masks  or  chip  layout  instructions.  Note 
that  research  is  needed  even  at  the  level  of  defining  just  what  represents  an 
acceptable  set  of  key  parameters.  At  the  level  of  automatic  chip  design  and 
layout  some  research  is  under  way,  however  much  of  it  appears  to  be  preliminary 
and  many  problems  remain.  For  instance,  assume  that  a  register  transfer 
language  is  used  to  specify  that  at  different  points  in  time  and  on  various 
conditions,  information  is  to  be  moved  from  each  of  several  different  registers 
and  input  pins  to  the  input  to  an  adder.  Other  data  tranfers  involving  these 
registers  and  inputs  must  also  be  performed  at  various  times  and  on  various 
conditions.  Question:  what  is  the  best  way  (and  layout)  of  setting  up 
communications  paths  between  these  logical  devices  (e.g.,  use  of  a  common 
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bus,  use  of  multiplexers,  demultiplexers,  chip  wide  communications  network, 
etc. ) ? 

Such  questions  can  often  be  answered  in  a  satisfactory  manner  when 
the  systems  involved  are  of  limited  complexity.  When  the  systems  involve 
millions  of  components  the  specification,  analysis  and  design  questions 
become  overwhelming.  These,  however,  are  the  sort  of  systems  we  will  want 
to  design  in  the  eighties,  and  the  military  in  particular  will  have  need  to 
design  such  systems  of  a  specialized  nature  to  achieve  its  performance  and 
reliability  goals.  To  solve  these  problems  of  the  middle  and  late  1980's  it 
is  imperative  that  appropriate  research  and  tool  development  programs  be 
initiated  now. 
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